Update README.md

94f8ba3 verified 8 months ago

5.06 kB

	---
	datasets:
	- wikitext
	metrics:
	- perplexity
	---
	Non-uniform GPTQ (NuGPTQ) combines [GPTQ](https://arxiv.org/abs/2210.17323), [SqueezeLLM](https://arxiv.org/abs/2306.07629) and [output scaling](https://stephenpanaro.com/blog/llm-quantization-for-iphone) for a competitive whole-tensor (no grouping) LLM compression method.

	Results for Llama-2-7b-hf:
	\|Method \|WikitextPPL (↓)\|Delta \|
	\|-- \|-- \|-- \|
	\|float16 \|8.7071 \|0 \|
	\|AWQ \|8.9760 \|0.2689\|
	\|NuGPTQ (This)\|9.2754 \|0.5683\|
	\|GPTQ† \|9.4686 \|0.7615\|
	<sub>† g128, desc_act=True</sub>

	<details>
	<summary>perplexity reproduction steps</summary>

	```shell
	git clone https://github.com/EleutherAI/lm-evaluation-harness
	cd lm-evaluation-harness
	pip install -e .
	pip install optimum

	huggingface-cli login

	# Set batch size based on your GPU.
	lm_eval --model hf \
	--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float16" \
	--tasks wikitext \
	--batch_size 1

	# hf (pretrained=meta-llama/Llama-2-7b-hf,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
	# \| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	# \|--------\|------:\|------\|-----:\|---------------\|-----:\|---\|------\|
	# \|wikitext\| 2\|none \| 0\|word_perplexity\|8.7071\|± \|N/A \|
	# \| \| \|none \| 0\|byte_perplexity\|1.4989\|± \|N/A \|
	# \| \| \|none \| 0\|bits_per_byte \|0.5839\|± \|N/A \|

	lm_eval --model hf \
	--model_args pretrained=smpanaro/Llama-2-7b-NuGPTQ,dtype="float16",use_safetensors=True,trust_remote_code=True \
	--tasks wikitext \
	--batch_size 1

	# hf (pretrained=smpanaro/llama-2-7b-nugptq,dtype=float16,use_safetensors=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
	# \| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	# \|--------\|------:\|------\|-----:\|---------------\|-----:\|---\|------\|
	# \|wikitext\| 2\|none \| 0\|word_perplexity\|9.2754\|± \|N/A \|
	# \| \| \|none \| 0\|byte_perplexity\|1.5167\|± \|N/A \|
	# \| \| \|none \| 0\|bits_per_byte \|0.6009\|± \|N/A \|

	pip install auto-gptq
	lm_eval --model hf \
	--model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-128g-actorder_True \
	--tasks wikitext \
	--batch_size 1

	# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-128g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
	# \| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	# \|--------\|------:\|------\|-----:\|---------------\|-----:\|---\|------\|
	# \|wikitext\| 2\|none \| 0\|word_perplexity\|9.4686\|± \|N/A \|
	# \| \| \|none \| 0\|byte_perplexity\|1.5225\|± \|N/A \|
	# \| \| \|none \| 0\|bits_per_byte \|0.6065\|± \|N/A \|

	lm_eval --model hf \
	--model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-32g-actorder_True \
	--tasks wikitext \
	--batch_size 1

	# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-32g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
	# \| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	# \|--------\|------:\|------\|-----:\|---------------\|-----:\|---\|------\|
	# \|wikitext\| 2\|none \| 0\|word_perplexity\|9.3801\|± \|N/A \|
	# \| \| \|none \| 0\|byte_perplexity\|1.5199\|± \|N/A \|
	# \| \| \|none \| 0\|bits_per_byte \|0.6040\|± \|N/A \|

	pip install autoawq
	lm_eval --model hf \
	--model_args pretrained=TheBloke/Llama-2-7B-AWQ,dtype="float16" \
	--tasks wikitext \
	--batch_size 1

	# hf (pretrained=thebloke/llama-2-7b-awq,dtype=float16), gen_kwargs: (none), limit: none, num_fewshot: none, batch_size: 1
	# \| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	# \|--------\|------:\|------\|-----:\|---------------\|-----:\|---\|------\|
	# \|wikitext\| 2\|none \| 0\|word_perplexity\|8.9760\|± \|N/A \|
	# \| \| \|none \| 0\|byte_perplexity\|1.5074\|± \|N/A \|
	# \| \| \|none \| 0\|bits_per_byte \|0.5921\|± \|N/A \|
	```

	</details>


	The model is fake quantized which means each weight has <= 16 (2<sup>4</sup>) unique values, but they are stored in float16. The uniqueness can be checked as follows:
	```python
	from transformers import AutoModelForCausalLM

	model = AutoModelForCausalLM.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True)
	linear_layers = ["k_proj", "q_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
	count = 0
	for key, tensor in model.state_dict().items():
	if "weight" not in key:
	continue
	if any([l in key for l in linear_layers]):
	assert tensor.unique().shape[0] <= 16, f"{key} has more than 16 unique values"
	print("✓", end="", flush=True)
	count += 1

	print()
	# 32 model layers * 7 linear layers
	print(f"{count} out of 224 linear layers have 16 unique values.")
	```