abhinavnmagic
commited on
Commit
•
c74766e
1
Parent(s):
faef99c
Update README.md
Browse files
README.md
CHANGED
@@ -29,8 +29,7 @@ This model was obtained by quantizing the weights of [Phi-3-mini-4k-instruct](ht
|
|
29 |
This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
|
30 |
|
31 |
Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
|
32 |
-
[
|
33 |
-
|
34 |
|
35 |
## Deployment
|
36 |
|
@@ -67,8 +66,7 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
|
|
67 |
|
68 |
### Use with transformers
|
69 |
|
70 |
-
|
71 |
-
The following example contemplates how the model can be used using the `generate()` function.
|
72 |
|
73 |
```python
|
74 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -112,12 +110,12 @@ print(tokenizer.decode(response, skip_special_tokens=True))
|
|
112 |
|
113 |
## Creation
|
114 |
|
115 |
-
This model was created by
|
116 |
-
Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
|
117 |
|
118 |
```python
|
119 |
from transformers import AutoTokenizer
|
120 |
-
from
|
|
|
121 |
from datasets import load_dataset
|
122 |
import random
|
123 |
|
@@ -141,26 +139,31 @@ examples = [
|
|
141 |
) for example in ds
|
142 |
]
|
143 |
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
damp_percent=0.1,
|
150 |
)
|
151 |
|
152 |
-
model =
|
153 |
model_id,
|
154 |
-
quantize_config,
|
155 |
device_map="auto",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
156 |
)
|
157 |
|
158 |
-
model.quantize(examples)
|
159 |
model.save_pretrained("Phi-3-mini-128k-instruct-quantized.w4a16")
|
160 |
```
|
161 |
|
162 |
|
163 |
-
|
164 |
## Evaluation
|
165 |
|
166 |
The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
|
|
|
29 |
This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
|
30 |
|
31 |
Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
|
32 |
+
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. Quantization is performed with 1% damping factor, group-size as 128 and 512 sequences sampled from [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus).
|
|
|
33 |
|
34 |
## Deployment
|
35 |
|
|
|
66 |
|
67 |
### Use with transformers
|
68 |
|
69 |
+
The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
|
|
|
70 |
|
71 |
```python
|
72 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
110 |
|
111 |
## Creation
|
112 |
|
113 |
+
This model was created by using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as presented in the code snipet below.
|
|
|
114 |
|
115 |
```python
|
116 |
from transformers import AutoTokenizer
|
117 |
+
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
|
118 |
+
from llmcompressor.modifiers.quantization import GPTQModifier
|
119 |
from datasets import load_dataset
|
120 |
import random
|
121 |
|
|
|
139 |
) for example in ds
|
140 |
]
|
141 |
|
142 |
+
recipe = GPTQModifier(
|
143 |
+
targets="Linear",
|
144 |
+
scheme="W4A16",
|
145 |
+
ignore=["lm_head"],
|
146 |
+
dampening_frac=0.1,
|
|
|
147 |
)
|
148 |
|
149 |
+
model = SparseAutoModelForCausalLM.from_pretrained(
|
150 |
model_id,
|
|
|
151 |
device_map="auto",
|
152 |
+
trust_remote_code=True,
|
153 |
+
)
|
154 |
+
|
155 |
+
oneshot(
|
156 |
+
model=model,
|
157 |
+
dataset=ds,
|
158 |
+
recipe=recipe,
|
159 |
+
max_seq_length=max_seq_len,
|
160 |
+
num_calibration_samples=num_samples,
|
161 |
)
|
162 |
|
|
|
163 |
model.save_pretrained("Phi-3-mini-128k-instruct-quantized.w4a16")
|
164 |
```
|
165 |
|
166 |
|
|
|
167 |
## Evaluation
|
168 |
|
169 |
The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
|