Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,148 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
+
# VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
|
11 |
+
|
12 |
+
## TL;DR
|
13 |
+
|
14 |
+
**Vector Post-Training Quantization (VPTQ)** is a novel Post-Training Quantization method that leverages **Vector Quantization** to high accuracy on LLMs at an extremely low bit-width (<2-bit).
|
15 |
+
VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
|
16 |
+
|
17 |
+
* Better Accuracy on 1-2 bits
|
18 |
+
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
|
19 |
+
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT
|
20 |
+
|
21 |
+
## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
|
22 |
+
|
23 |
+
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
|
24 |
+
|
25 |
+
|
26 |
+
## Early Results from Tech Report
|
27 |
+
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
|
28 |
+
|
29 |
+
<img src="assets/vptq.png" width="500">
|
30 |
+
|
31 |
+
| Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
|
32 |
+
| ----------- | -------- | ---- | ---- | ------ | ------ | ------- | ------- |
|
33 |
+
| LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
|
34 |
+
| | 2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 |
|
35 |
+
| LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
|
36 |
+
| | 2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 |
|
37 |
+
| LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
|
38 |
+
| | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
|
39 |
+
|
40 |
+
|
41 |
+
## Install and Evaluation
|
42 |
+
|
43 |
+
### Dependencies
|
44 |
+
- python 3.10+
|
45 |
+
- torch >= 2.2.0
|
46 |
+
- transformers >= 4.44.0
|
47 |
+
- Accelerate >= 0.33.0
|
48 |
+
- latest datasets
|
49 |
+
|
50 |
+
### Installation
|
51 |
+
|
52 |
+
> Preparation steps that might be needed: Set up CUDA PATH.
|
53 |
+
```bash
|
54 |
+
export PATH=/usr/local/cuda-12/bin/:$PATH # set dependent on your environment
|
55 |
+
```
|
56 |
+
|
57 |
+
```python
|
58 |
+
pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
|
59 |
+
```
|
60 |
+
|
61 |
+
### Language Generation
|
62 |
+
To generate text using the pre-trained model, you can use the following code snippet:
|
63 |
+
|
64 |
+
```python
|
65 |
+
python -m vptq --model=LLaMa-2-7b-1.5bi-vptq --prompt="Do Not Go Gentle into That Good Night"
|
66 |
+
```
|
67 |
+
|
68 |
+
Launching a chatbot:
|
69 |
+
Note that you must use a chat model for this to work
|
70 |
+
|
71 |
+
```python
|
72 |
+
python -m vptq --model=LLaMa-2-7b-chat-1.5b-vptq --chat
|
73 |
+
```
|
74 |
+
Using the Python API:
|
75 |
+
|
76 |
+
```python
|
77 |
+
import vptq
|
78 |
+
import transformers
|
79 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained("LLaMa-2-7b-1.5bi-vptq")
|
80 |
+
m = vptq.AutoModelForCausalLM.from_pretrained("LLaMa-2-7b-1.5bi-vptq", device_map='auto')
|
81 |
+
|
82 |
+
inputs = tokenizer("Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
|
83 |
+
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
|
84 |
+
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
85 |
+
```
|
86 |
+
|
87 |
+
### Gradio app example
|
88 |
+
A environment variable is available to control share link or not.
|
89 |
+
`export SHARE_LINK=1`
|
90 |
+
```
|
91 |
+
python -m vptq.app
|
92 |
+
```
|
93 |
+
|
94 |
+
## Road Map
|
95 |
+
- [ ] Merge the quantization algorithm into the public repository.
|
96 |
+
- [ ] Submit the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp).
|
97 |
+
- [ ] Improve the implementation of the inference kernel.
|
98 |
+
- [ ] **TBC**
|
99 |
+
|
100 |
+
## Project main members:
|
101 |
+
* Yifei Liu (@lyf-00)
|
102 |
+
* Jicheng Wen (@wejoncy)
|
103 |
+
* Yang Wang (@YangWang92)
|
104 |
+
|
105 |
+
## Acknowledgement
|
106 |
+
|
107 |
+
* We thank for **James Hensman** for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
|
108 |
+
* We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.
|
109 |
+
|
110 |
+
## Publication
|
111 |
+
EMNLP 2024 Main
|
112 |
+
```bibtex
|
113 |
+
@inproceedings{
|
114 |
+
vptq,
|
115 |
+
title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},
|
116 |
+
author={Yifei Liu and
|
117 |
+
Jicheng Wen and
|
118 |
+
Yang Wang and
|
119 |
+
Shengyu Ye and
|
120 |
+
Li Lyna Zhang and
|
121 |
+
Ting Cao, Cheng Li and
|
122 |
+
Mao Yang},
|
123 |
+
booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
|
124 |
+
year={2024},
|
125 |
+
}
|
126 |
+
```
|
127 |
+
|
128 |
+
## Limitation of VPTQ
|
129 |
+
* ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
|
130 |
+
* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository project cannot guarantee the performance of those models.
|
131 |
+
* ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
|
132 |
+
* ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.
|
133 |
+
|
134 |
+
## Contributing
|
135 |
+
|
136 |
+
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
137 |
+
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
138 |
+
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
|
139 |
+
|
140 |
+
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
|
141 |
+
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
|
142 |
+
provided by the bot. You will only need to do this once across all repos using our CLA.
|
143 |
+
|
144 |
+
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
145 |
+
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
146 |
+
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
147 |
+
|
148 |
+
## Trademarks
|
149 |
+
|
150 |
+
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
151 |
+
trademarks or logos is subject to and must follow
|
152 |
+
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
|
153 |
+
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
|
154 |
+
Any use of third-party trademarks or logos are subject to those third-party's policies.
|