OpenSourceRonin commited on
Commit
3b6eb87
1 Parent(s): 48ee084

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -18,6 +18,18 @@ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and m
18
  * Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
19
  * Agile Quantization Inference: low decode overhead, best throughput, and TTFT
20
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
22
 
23
  Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
 
18
  * Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
19
  * Agile Quantization Inference: low decode overhead, best throughput, and TTFT
20
 
21
+ [**arXiv**](https://arxiv.org/abs/2409.17066): https://arxiv.org/abs/2409.17066
22
+
23
+ [**Models from Community**](https://huggingface.co/VPTQ-community): https://huggingface.co/VPTQ-community
24
+
25
+ [**Github**](https://github.com/microsoft/vptq) https://github.com/microsoft/vptq
26
+
27
+ Prompt example: Llama 3.1 70B on RTX4090 (24 GB@2bit)
28
+ ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/66a73179315d9b5c32e06967/lTfvSARTs9YfCkpEe3Sxc.gif)
29
+
30
+ Chat example: Llama 3.1 70B on RTX4090 (24 GB@2bit)
31
+ ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/66a73179315d9b5c32e06967/QZeqC_EhZVwozEV_WcFtV.gif)
32
+
33
  ## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
34
 
35
  Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.