OpenSourceRonin
commited on
Commit
•
3b6eb87
1
Parent(s):
48ee084
Update README.md
Browse files
README.md
CHANGED
@@ -18,6 +18,18 @@ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and m
|
|
18 |
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
|
19 |
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
|
22 |
|
23 |
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
|
|
|
18 |
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
|
19 |
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT
|
20 |
|
21 |
+
[**arXiv**](https://arxiv.org/abs/2409.17066): https://arxiv.org/abs/2409.17066
|
22 |
+
|
23 |
+
[**Models from Community**](https://huggingface.co/VPTQ-community): https://huggingface.co/VPTQ-community
|
24 |
+
|
25 |
+
[**Github**](https://github.com/microsoft/vptq) https://github.com/microsoft/vptq
|
26 |
+
|
27 |
+
Prompt example: Llama 3.1 70B on RTX4090 (24 GB@2bit)
|
28 |
+
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/66a73179315d9b5c32e06967/lTfvSARTs9YfCkpEe3Sxc.gif)
|
29 |
+
|
30 |
+
Chat example: Llama 3.1 70B on RTX4090 (24 GB@2bit)
|
31 |
+
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/66a73179315d9b5c32e06967/QZeqC_EhZVwozEV_WcFtV.gif)
|
32 |
+
|
33 |
## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
|
34 |
|
35 |
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
|