edumunozsala commited on
Commit
2b701bb
1 Parent(s): ed45863

Upload README.md

Browse files

Add a README file

Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - llama-2
4
+ - gptq
5
+ - quantization
6
+ - code
7
+ - llama-2
8
+ model-index:
9
+ - name: Llama-2-7b-4bit-GPTQ-python-coder
10
+ results: []
11
+ license: gpl-3.0
12
+ language:
13
+ - code
14
+ datasets:
15
+ - iamtarun/python_code_instructions_18k_alpaca
16
+ pipeline_tag: text-generation
17
+ library_name: transformers
18
+ ---
19
+
20
+
21
+ # LlaMa 2 7b 4-bit GPTQ Python Coder 👩‍💻
22
+
23
+ This model is the **GPTQ Quantization of my Llama 2 7B 4-bit Python Coder**. The base model link is [here](https://huggingface.co/edumunozsala/llama-2-7b-int4-python-code-20k)
24
+
25
+ The quantization parameters for the GPTQ algo are:
26
+ - 4-bit quantization
27
+ - Group size is 128
28
+ - Dataset C4
29
+ - Decreasing activation is False
30
+
31
+
32
+ ## Model Description
33
+
34
+ [Llama 2 7B 4-bit Python Coder](https://huggingface.co/edumunozsala/llama-2-7b-int4-python-code-20k) is a fine-tuned version of the Llama 2 7B model using QLoRa in 4-bit with [PEFT](https://github.com/huggingface/peft) library and bitsandbytes.
35
+
36
+
37
+ ## Quantization
38
+
39
+ A quick definition extracted from a great article in Medium by Benjamin Marie ["GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2"](https://medium.com/towards-data-science/gptq-or-bitsandbytes-which-quantization-method-to-use-for-llms-examples-with-llama-2-f79bc03046dc) (Only for Medium subscribers)
40
+
41
+ *"GPTQ (Frantar et al., 2023) was first applied to models ready to deploy. In other words, once the model is fully fine-tuned, GPTQ will be applied to reduce its size. GPTQ can lower the weight precision to 4-bit or 3-bit.
42
+ In practice, GPTQ is mainly used for 4-bit quantization. 3-bit has been shown very unstable (Dettmers and Zettlemoyer, 2023). It quantizes without loading the entire model into memory. Instead, GPTQ loads and quantizes the LLM module by module.
43
+ Quantization also requires a small sample of data for calibration which can take more than one hour on a consumer GPU."*
44
+
45
+
46
+
47
+ ### Example of usage
48
+
49
+ ```py
50
+ import torch
51
+ from transformers import AutoModelForCausalLM, AutoTokenizer
52
+
53
+ model_id = "edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k"
54
+
55
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
56
+
57
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
58
+
59
+ instruction="Write a Python function to display the first and last elements of a list."
60
+ input=""
61
+
62
+ prompt = f"""### Instruction:
63
+ Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.
64
+
65
+ ### Task:
66
+ {instruction}
67
+
68
+ ### Input:
69
+ {input}
70
+
71
+ ### Response:
72
+ """
73
+
74
+ input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
75
+ # with torch.inference_mode():
76
+ outputs = model.generate(input_ids=input_ids, max_new_tokens=128, do_sample=True, top_p=0.9,temperature=0.3)
77
+
78
+ print(f"Prompt:\n{prompt}\n")
79
+ print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
80
+
81
+ ```
82
+
83
+ ### Citation
84
+
85
+ ```
86
+ @misc {edumunozsala_2023,
87
+ author = { {Eduardo Muñoz} },
88
+ title = { llama-2-7b-int4-GPTQ-python-coder },
89
+ year = 2023,
90
+ url = { https://huggingface.co/edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k },
91
+ publisher = { Hugging Face }
92
+ }
93
+ ```