abhinavnmagic commited on
Commit
c74766e
1 Parent(s): faef99c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -17
README.md CHANGED
@@ -29,8 +29,7 @@ This model was obtained by quantizing the weights of [Phi-3-mini-4k-instruct](ht
29
  This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
30
 
31
  Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
32
- [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 10% damping factor, group-size as 128 and 512 sequences sampled from [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus).
33
-
34
 
35
  ## Deployment
36
 
@@ -67,8 +66,7 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
67
 
68
  ### Use with transformers
69
 
70
- This model is supported by Transformers leveraging the integration with the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) data format.
71
- The following example contemplates how the model can be used using the `generate()` function.
72
 
73
  ```python
74
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -112,12 +110,12 @@ print(tokenizer.decode(response, skip_special_tokens=True))
112
 
113
  ## Creation
114
 
115
- This model was created by applying the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library as presented in the code snipet below.
116
- Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
117
 
118
  ```python
119
  from transformers import AutoTokenizer
120
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 
121
  from datasets import load_dataset
122
  import random
123
 
@@ -141,26 +139,31 @@ examples = [
141
  ) for example in ds
142
  ]
143
 
144
- quantize_config = BaseQuantizeConfig(
145
- bits=4,
146
- group_size=128,
147
- desc_act=True,
148
- model_file_base_name="model",
149
- damp_percent=0.1,
150
  )
151
 
152
- model = AutoGPTQForCausalLM.from_pretrained(
153
  model_id,
154
- quantize_config,
155
  device_map="auto",
 
 
 
 
 
 
 
 
 
156
  )
157
 
158
- model.quantize(examples)
159
  model.save_pretrained("Phi-3-mini-128k-instruct-quantized.w4a16")
160
  ```
161
 
162
 
163
-
164
  ## Evaluation
165
 
166
  The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
 
29
  This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
30
 
31
  Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
32
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. Quantization is performed with 1% damping factor, group-size as 128 and 512 sequences sampled from [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus).
 
33
 
34
  ## Deployment
35
 
 
66
 
67
  ### Use with transformers
68
 
69
+ The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
 
70
 
71
  ```python
72
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
110
 
111
  ## Creation
112
 
113
+ This model was created by using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as presented in the code snipet below.
 
114
 
115
  ```python
116
  from transformers import AutoTokenizer
117
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
118
+ from llmcompressor.modifiers.quantization import GPTQModifier
119
  from datasets import load_dataset
120
  import random
121
 
 
139
  ) for example in ds
140
  ]
141
 
142
+ recipe = GPTQModifier(
143
+ targets="Linear",
144
+ scheme="W4A16",
145
+ ignore=["lm_head"],
146
+ dampening_frac=0.1,
 
147
  )
148
 
149
+ model = SparseAutoModelForCausalLM.from_pretrained(
150
  model_id,
 
151
  device_map="auto",
152
+ trust_remote_code=True,
153
+ )
154
+
155
+ oneshot(
156
+ model=model,
157
+ dataset=ds,
158
+ recipe=recipe,
159
+ max_seq_length=max_seq_len,
160
+ num_calibration_samples=num_samples,
161
  )
162
 
 
163
  model.save_pretrained("Phi-3-mini-128k-instruct-quantized.w4a16")
164
  ```
165
 
166
 
 
167
  ## Evaluation
168
 
169
  The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command: