ByteDance
/

AffineQuant

Safetensors

English

Model card Files Files and versions Community

HXLee commited on Apr 23

Commit

0b01612

•

1 Parent(s): 9c1d5dc

update readme.md

Browse files

Files changed (1) hide show

README.md +104 -3

README.md CHANGED Viewed

@@ -1,3 +1,104 @@
----
-license: apache-2.0
----

+# AffineQuant Model Zoo
+AffineQuant is a novel quantization method that uses an affine transformation matrix to change the distribution of weights and activations, aimed at optimizing the distribution of weight activations and reducing quantization errors. By introducing an affine transformation matrix, AffineQuant can better align the data distribution with the quantization function, thereby reducing quantization errors. The matrix optimization objective is to minimize the mean squared error between pre- and post-quantization feature map, while introducing the Gradual Mask (GM) method to maintain the strictly diagonal dominance of the affine matrix, ensuring the matrix's invertibility and stable convergence. Experimental results show that AffineQuant performs better than existing quantization methods, such as OmniQuant and SmoothQuant, achieving consistent performance improvements across different quantization configurations and datasets.
+Code: [https://github.com/bytedance/AffineQuant](https://github.com/bytedance/AffineQuant)
+Paper: [https://arxiv.org/abs/2403.12544](https://arxiv.org/abs/2403.12544)
+## How to use
+This repository contains models with various quantization configurations. The types of models include: OPT, LLaMA1&2.
+### Fake Quantization Accuracy
+To reproduce the accuracy reported in the paper, we need to use the ```--model``` parameter to load the fake-quantized model. At the same time, we need to specify the bit parameter as 16 to skip the quantization step. For example:
+```
+CUDA_VISIBLE_DEVICES=0 python main.py \
+--model /path/to/llama-13b-w2a16g128 --eval_ppl \
+--output_dir ./log/llama-13b-w2a16g128 \
+--wbits 16 --abits 16
+```
+It is worth noting that if your quantization model is trained using the ```--let``` parameter, you need to enable the bias in the layernorm layers and specific linear layers within the transformer repository to load the shift parameters. For instance, for the llama model, we make the following modifications in ```modeling_llama.py```:
+1. Set the bias of the q,k,v,o,up,gate linear layer to True.
+```
+self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
+self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
+self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
+self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=True)
+self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
+self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=True)
+```
+2. Enable the bias in RMSNorm. We directly replace the original RMSNorm with ```AffineLlamaRMSNorm``` from AffineQuant.
+## Inference Overhead
+To reproduce the accuracy described in the paper, our weight-only quantization configuration imposes no restrictions on the affine matrices after layernorm. For the weight-activation configuration, such as 4/4 bits, we only update the diagonal elements of the affine matrices after layernorm. Therefore, the model inference with merged parameters incurs no additional overhead.
+## Benchmarks
+We evaluate the quantization performance of LLaMA-7B, 13B, 30B on six zero-shot datasets using 4/4 bit quantization in the following table.
+|                        | PIQA($\uparrow$) | ARC-e($\uparrow$) | WinoGrande($\uparrow$) | BoolQ($\uparrow$) | ARC-c($\uparrow$) | HellaSwag($\uparrow$) | Avg.($\uparrow$) |
+| ---------------------- | ---------------- | ----------------- | ---------------------- | ----------------- | ----------------- | --------------------- | ---------------- |
+| LLaMA-7B, OmniQuant    | 66.15            | 45.20             | 53.43                  | 63.51             | 31.14             | 56.44                 | 52.65            |
+| LLaMA-7B, AffineQuant  | 69.37            | 42.55             | 55.33                  | 63.73             | 31.91             | 57.65                 | 53.42            |
+| LLaMA-13B, OmniQuant   | 69.69            | 47.39             | 55.80                  | 62.84             | 33.10             | 58.96                 | 54.37            |
+| LLaMA-13B, AffineQuant | 66.32            | 43.90             | 54.70                  | 64.10             | 29.61             | 56.88                 | 52.58            |
+| LLaMA-30B, OmniQuant   | 71.21            | 49.45             | 59.19                  | 65.33             | 34.47             | 64.65                 | 56.63            |
+| LLaMA-30B, AffineQuant | 70.84            | 49.41             | 58.64                  | 70.12             | 37.12             | 65.53                 | 58.61            |
+Meanwhile, we compare the 4/4 bit quantization performance of LLaMA1&2 models on WikiText2 and C4 datasets in the following table.
+|            | Methods     | WikiText2 | C4    |
+| ---------- | ----------- | --------- | ----- |
+| LLaMA-7B   | OmniQuant   | 11.26     | 14.51 |
+|            | AffineQuant | 10.28     | 13.64 |
+| LLaMA-13B  | OmniQuant   | 10.87     | 13.78 |
+|            | AffineQuant | 10.32     | 13.44 |
+| LLaMA-30B  | OmniQuant   | 10.33     | 12.49 |
+|            | AffineQuant | 9.35      | 11.58 |
+| LLaMA2-7B  | OmniQuant   | 14.26     | 18.02 |
+|            | AffineQuant | 12.69     | 15.76 |
+| LLaMA2-13B | OmniQuant   | 12.30     | 14.55 |
+|            | AffineQuant | 11.45     | 13.97 |
+## Related Project
+[SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://github.com/mit-han-lab/smoothquant)
+[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://github.com/mit-han-lab/llm-awq)
+[GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers](https://github.com/IST-DASLab/gptq)
+[RPTQ: Reorder-Based Post-Training Quantization for Large Language Models](https://github.com/hahnyuan/RPTQ4LLM)
+[OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models](https://github.com/OpenGVLab/OmniQuant)
+[MLC LLM](https://github.com/mlc-ai/mlc-llm)
+[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)
+## Citation
+```latex
+@inproceedings{
+ma2024affinequant,
+title={AffineQuant: Affine Transformation Quantization for Large Language Models},
+author={Yuexiao Ma and Huixia Li and Xiawu Zheng and Feng Ling and Xuefeng Xiao and Rui Wang and Shilei Wen and Fei Chao and Rongrong Ji},
+booktitle={The Twelfth International Conference on Learning Representations},
+year={2024},
+url={https://openreview.net/forum?id=of2rhALq8l}
+}
+```