Spaces:

VPTQ
/

README

No application file

App Files Files Community

OpenSourceRonin commited on Sep 28

Commit

3b6eb87

•

1 Parent(s): 48ee084

Update README.md

Browse files

Files changed (1) hide show

README.md +12 -0

README.md CHANGED Viewed

@@ -18,6 +18,18 @@ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and m
 * Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
 * Agile Quantization Inference: low decode overhead, best throughput, and TTFT
 ## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
 Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.

 * Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
 * Agile Quantization Inference: low decode overhead, best throughput, and TTFT
+[**arXiv**](https://arxiv.org/abs/2409.17066): https://arxiv.org/abs/2409.17066
+[**Models from Community**](https://huggingface.co/VPTQ-community): https://huggingface.co/VPTQ-community
+[**Github**](https://github.com/microsoft/vptq) https://github.com/microsoft/vptq
+Prompt example: Llama 3.1 70B on RTX4090 (24 GB@2bit)
+![image/gif](https://cdn-uploads.huggingface.co/production/uploads/66a73179315d9b5c32e06967/lTfvSARTs9YfCkpEe3Sxc.gif)
+Chat example: Llama 3.1 70B on RTX4090 (24 GB@2bit)
+![image/gif](https://cdn-uploads.huggingface.co/production/uploads/66a73179315d9b5c32e06967/QZeqC_EhZVwozEV_WcFtV.gif)
 ## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
 Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.