OpenSourceRonin commited on
Commit
48ee084
·
verified ·
1 Parent(s): 6fad698

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -1
README.md CHANGED
@@ -7,4 +7,148 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
11
+
12
+ ## TL;DR
13
+
14
+ **Vector Post-Training Quantization (VPTQ)** is a novel Post-Training Quantization method that leverages **Vector Quantization** to high accuracy on LLMs at an extremely low bit-width (<2-bit).
15
+ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
16
+
17
+ * Better Accuracy on 1-2 bits
18
+ * Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
19
+ * Agile Quantization Inference: low decode overhead, best throughput, and TTFT
20
+
21
+ ## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
22
+
23
+ Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
24
+
25
+
26
+ ## Early Results from Tech Report
27
+ VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
28
+
29
+ <img src="assets/vptq.png" width="500">
30
+
31
+ | Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
32
+ | ----------- | -------- | ---- | ---- | ------ | ------ | ------- | ------- |
33
+ | LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
34
+ | | 2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 |
35
+ | LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
36
+ | | 2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 |
37
+ | LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
38
+ | | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
39
+
40
+
41
+ ## Install and Evaluation
42
+
43
+ ### Dependencies
44
+ - python 3.10+
45
+ - torch >= 2.2.0
46
+ - transformers >= 4.44.0
47
+ - Accelerate >= 0.33.0
48
+ - latest datasets
49
+
50
+ ### Installation
51
+
52
+ > Preparation steps that might be needed: Set up CUDA PATH.
53
+ ```bash
54
+ export PATH=/usr/local/cuda-12/bin/:$PATH # set dependent on your environment
55
+ ```
56
+
57
+ ```python
58
+ pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
59
+ ```
60
+
61
+ ### Language Generation
62
+ To generate text using the pre-trained model, you can use the following code snippet:
63
+
64
+ ```python
65
+ python -m vptq --model=LLaMa-2-7b-1.5bi-vptq --prompt="Do Not Go Gentle into That Good Night"
66
+ ```
67
+
68
+ Launching a chatbot:
69
+ Note that you must use a chat model for this to work
70
+
71
+ ```python
72
+ python -m vptq --model=LLaMa-2-7b-chat-1.5b-vptq --chat
73
+ ```
74
+ Using the Python API:
75
+
76
+ ```python
77
+ import vptq
78
+ import transformers
79
+ tokenizer = transformers.AutoTokenizer.from_pretrained("LLaMa-2-7b-1.5bi-vptq")
80
+ m = vptq.AutoModelForCausalLM.from_pretrained("LLaMa-2-7b-1.5bi-vptq", device_map='auto')
81
+
82
+ inputs = tokenizer("Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
83
+ out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
84
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
85
+ ```
86
+
87
+ ### Gradio app example
88
+ A environment variable is available to control share link or not.
89
+ `export SHARE_LINK=1`
90
+ ```
91
+ python -m vptq.app
92
+ ```
93
+
94
+ ## Road Map
95
+ - [ ] Merge the quantization algorithm into the public repository.
96
+ - [ ] Submit the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp).
97
+ - [ ] Improve the implementation of the inference kernel.
98
+ - [ ] **TBC**
99
+
100
+ ## Project main members:
101
+ * Yifei Liu (@lyf-00)
102
+ * Jicheng Wen (@wejoncy)
103
+ * Yang Wang (@YangWang92)
104
+
105
+ ## Acknowledgement
106
+
107
+ * We thank for **James Hensman** for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
108
+ * We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.
109
+
110
+ ## Publication
111
+ EMNLP 2024 Main
112
+ ```bibtex
113
+ @inproceedings{
114
+ vptq,
115
+ title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},
116
+ author={Yifei Liu and
117
+ Jicheng Wen and
118
+ Yang Wang and
119
+ Shengyu Ye and
120
+ Li Lyna Zhang and
121
+ Ting Cao, Cheng Li and
122
+ Mao Yang},
123
+ booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
124
+ year={2024},
125
+ }
126
+ ```
127
+
128
+ ## Limitation of VPTQ
129
+ * ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
130
+ * ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository project cannot guarantee the performance of those models.
131
+ * ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
132
+ * ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.
133
+
134
+ ## Contributing
135
+
136
+ This project welcomes contributions and suggestions. Most contributions require you to agree to a
137
+ Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
138
+ the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
139
+
140
+ When you submit a pull request, a CLA bot will automatically determine whether you need to provide
141
+ a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
142
+ provided by the bot. You will only need to do this once across all repos using our CLA.
143
+
144
+ This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
145
+ For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
146
+ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
147
+
148
+ ## Trademarks
149
+
150
+ This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
151
+ trademarks or logos is subject to and must follow
152
+ [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
153
+ Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
154
+ Any use of third-party trademarks or logos are subject to those third-party's policies.