edumunozsala
commited on
Commit
·
2b701bb
1
Parent(s):
ed45863
Upload README.md
Browse filesAdd a README file
README.md
ADDED
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- llama-2
|
4 |
+
- gptq
|
5 |
+
- quantization
|
6 |
+
- code
|
7 |
+
- llama-2
|
8 |
+
model-index:
|
9 |
+
- name: Llama-2-7b-4bit-GPTQ-python-coder
|
10 |
+
results: []
|
11 |
+
license: gpl-3.0
|
12 |
+
language:
|
13 |
+
- code
|
14 |
+
datasets:
|
15 |
+
- iamtarun/python_code_instructions_18k_alpaca
|
16 |
+
pipeline_tag: text-generation
|
17 |
+
library_name: transformers
|
18 |
+
---
|
19 |
+
|
20 |
+
|
21 |
+
# LlaMa 2 7b 4-bit GPTQ Python Coder 👩💻
|
22 |
+
|
23 |
+
This model is the **GPTQ Quantization of my Llama 2 7B 4-bit Python Coder**. The base model link is [here](https://huggingface.co/edumunozsala/llama-2-7b-int4-python-code-20k)
|
24 |
+
|
25 |
+
The quantization parameters for the GPTQ algo are:
|
26 |
+
- 4-bit quantization
|
27 |
+
- Group size is 128
|
28 |
+
- Dataset C4
|
29 |
+
- Decreasing activation is False
|
30 |
+
|
31 |
+
|
32 |
+
## Model Description
|
33 |
+
|
34 |
+
[Llama 2 7B 4-bit Python Coder](https://huggingface.co/edumunozsala/llama-2-7b-int4-python-code-20k) is a fine-tuned version of the Llama 2 7B model using QLoRa in 4-bit with [PEFT](https://github.com/huggingface/peft) library and bitsandbytes.
|
35 |
+
|
36 |
+
|
37 |
+
## Quantization
|
38 |
+
|
39 |
+
A quick definition extracted from a great article in Medium by Benjamin Marie ["GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2"](https://medium.com/towards-data-science/gptq-or-bitsandbytes-which-quantization-method-to-use-for-llms-examples-with-llama-2-f79bc03046dc) (Only for Medium subscribers)
|
40 |
+
|
41 |
+
*"GPTQ (Frantar et al., 2023) was first applied to models ready to deploy. In other words, once the model is fully fine-tuned, GPTQ will be applied to reduce its size. GPTQ can lower the weight precision to 4-bit or 3-bit.
|
42 |
+
In practice, GPTQ is mainly used for 4-bit quantization. 3-bit has been shown very unstable (Dettmers and Zettlemoyer, 2023). It quantizes without loading the entire model into memory. Instead, GPTQ loads and quantizes the LLM module by module.
|
43 |
+
Quantization also requires a small sample of data for calibration which can take more than one hour on a consumer GPU."*
|
44 |
+
|
45 |
+
|
46 |
+
|
47 |
+
### Example of usage
|
48 |
+
|
49 |
+
```py
|
50 |
+
import torch
|
51 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
52 |
+
|
53 |
+
model_id = "edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k"
|
54 |
+
|
55 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
56 |
+
|
57 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
|
58 |
+
|
59 |
+
instruction="Write a Python function to display the first and last elements of a list."
|
60 |
+
input=""
|
61 |
+
|
62 |
+
prompt = f"""### Instruction:
|
63 |
+
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.
|
64 |
+
|
65 |
+
### Task:
|
66 |
+
{instruction}
|
67 |
+
|
68 |
+
### Input:
|
69 |
+
{input}
|
70 |
+
|
71 |
+
### Response:
|
72 |
+
"""
|
73 |
+
|
74 |
+
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
|
75 |
+
# with torch.inference_mode():
|
76 |
+
outputs = model.generate(input_ids=input_ids, max_new_tokens=128, do_sample=True, top_p=0.9,temperature=0.3)
|
77 |
+
|
78 |
+
print(f"Prompt:\n{prompt}\n")
|
79 |
+
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
|
80 |
+
|
81 |
+
```
|
82 |
+
|
83 |
+
### Citation
|
84 |
+
|
85 |
+
```
|
86 |
+
@misc {edumunozsala_2023,
|
87 |
+
author = { {Eduardo Muñoz} },
|
88 |
+
title = { llama-2-7b-int4-GPTQ-python-coder },
|
89 |
+
year = 2023,
|
90 |
+
url = { https://huggingface.co/edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k },
|
91 |
+
publisher = { Hugging Face }
|
92 |
+
}
|
93 |
+
```
|