|
--- |
|
datasets: |
|
- wikitext |
|
metrics: |
|
- perplexity |
|
--- |
|
**N**on-**u**niform **GPTQ** (NuGPTQ) combines [GPTQ](https://arxiv.org/abs/2210.17323), [SqueezeLLM](https://arxiv.org/abs/2306.07629) and [output scaling](https://stephenpanaro.com/blog/llm-quantization-for-iphone) for a competitive whole-tensor (no grouping) LLM compression method. |
|
|
|
Results for Llama-2-7b-hf: |
|
|Method |WikitextPPL (↓)|Delta | |
|
|-- |-- |-- | |
|
|float16 |8.7071 |0 | |
|
|AWQ |8.9760 |0.2689| |
|
|NuGPTQ (This)|9.2754 |0.5683| |
|
|GPTQ† |9.4686 |0.7615| |
|
<sub>† g128, desc_act=True</sub> |
|
|
|
<details> |
|
<summary>perplexity reproduction steps</summary> |
|
|
|
```shell |
|
git clone https://github.com/EleutherAI/lm-evaluation-harness |
|
cd lm-evaluation-harness |
|
pip install -e . |
|
pip install optimum |
|
|
|
huggingface-cli login |
|
|
|
# Set batch size based on your GPU. |
|
lm_eval --model hf \ |
|
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float16" \ |
|
--tasks wikitext \ |
|
--batch_size 1 |
|
|
|
# hf (pretrained=meta-llama/Llama-2-7b-hf,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 |
|
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |
|
# |--------|------:|------|-----:|---------------|-----:|---|------| |
|
# |wikitext| 2|none | 0|word_perplexity|8.7071|± |N/A | |
|
# | | |none | 0|byte_perplexity|1.4989|± |N/A | |
|
# | | |none | 0|bits_per_byte |0.5839|± |N/A | |
|
|
|
lm_eval --model hf \ |
|
--model_args pretrained=smpanaro/Llama-2-7b-NuGPTQ,dtype="float16",use_safetensors=True,trust_remote_code=True \ |
|
--tasks wikitext \ |
|
--batch_size 1 |
|
|
|
# hf (pretrained=smpanaro/llama-2-7b-nugptq,dtype=float16,use_safetensors=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 |
|
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |
|
# |--------|------:|------|-----:|---------------|-----:|---|------| |
|
# |wikitext| 2|none | 0|word_perplexity|9.2754|± |N/A | |
|
# | | |none | 0|byte_perplexity|1.5167|± |N/A | |
|
# | | |none | 0|bits_per_byte |0.6009|± |N/A | |
|
|
|
pip install auto-gptq |
|
lm_eval --model hf \ |
|
--model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-128g-actorder_True \ |
|
--tasks wikitext \ |
|
--batch_size 1 |
|
|
|
# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-128g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 |
|
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |
|
# |--------|------:|------|-----:|---------------|-----:|---|------| |
|
# |wikitext| 2|none | 0|word_perplexity|9.4686|± |N/A | |
|
# | | |none | 0|byte_perplexity|1.5225|± |N/A | |
|
# | | |none | 0|bits_per_byte |0.6065|± |N/A | |
|
|
|
lm_eval --model hf \ |
|
--model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-32g-actorder_True \ |
|
--tasks wikitext \ |
|
--batch_size 1 |
|
|
|
# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-32g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 |
|
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |
|
# |--------|------:|------|-----:|---------------|-----:|---|------| |
|
# |wikitext| 2|none | 0|word_perplexity|9.3801|± |N/A | |
|
# | | |none | 0|byte_perplexity|1.5199|± |N/A | |
|
# | | |none | 0|bits_per_byte |0.6040|± |N/A | |
|
|
|
pip install autoawq |
|
lm_eval --model hf \ |
|
--model_args pretrained=TheBloke/Llama-2-7B-AWQ,dtype="float16" \ |
|
--tasks wikitext \ |
|
--batch_size 1 |
|
|
|
# hf (pretrained=thebloke/llama-2-7b-awq,dtype=float16), gen_kwargs: (none), limit: none, num_fewshot: none, batch_size: 1 |
|
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |
|
# |--------|------:|------|-----:|---------------|-----:|---|------| |
|
# |wikitext| 2|none | 0|word_perplexity|8.9760|± |N/A | |
|
# | | |none | 0|byte_perplexity|1.5074|± |N/A | |
|
# | | |none | 0|bits_per_byte |0.5921|± |N/A | |
|
``` |
|
|
|
</details> |
|
|
|
|
|
The model is fake quantized which means each weight has <= 16 (2<sup>4</sup>) unique values, but they are stored in float16. The uniqueness can be checked as follows: |
|
```python |
|
from transformers import AutoModelForCausalLM |
|
|
|
model = AutoModelForCausalLM.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True) |
|
linear_layers = ["k_proj", "q_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] |
|
count = 0 |
|
for key, tensor in model.state_dict().items(): |
|
if "weight" not in key: |
|
continue |
|
if any([l in key for l in linear_layers]): |
|
assert tensor.unique().shape[0] <= 16, f"{key} has more than 16 unique values" |
|
print("✓", end="", flush=True) |
|
count += 1 |
|
|
|
print() |
|
# 32 model layers * 7 linear layers |
|
print(f"{count} out of 224 linear layers have 16 unique values.") |
|
``` |