|
**SqueezeLLM** is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving. |
|
|
|
**TLDR:** Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. |
|
But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. |
|
Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, |
|
as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, |
|
we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. |
|
For more details please check out our [paper](https://arxiv.org/pdf/2306.07629.pdf). |
|
|
|
|
|
## Model description |
|
|
|
3-bit XGen-7B instruction-tuned model (i.e. finetuned model on public domain instructional data) with 8K sequence length quantized using SqueezeLLM. |
|
More details on the quantization method can be found in the [paper](https://arxiv.org/pdf/2306.07629.pdf). |
|
More detailed model descriptions can be found in the [link](https://huggingface.co/Salesforce/xgen-7b-8k-inst). |
|
|
|
|
|
* **Base Model:** [XGen-7B-8K-Inst](https://huggingface.co/Salesforce/xgen-7b-8k-inst) (by Salesforce AI Research) |
|
* **Bitwidth:** 3-bit |
|
* **Sparsity Level:** 0% (dense-only) |
|
|
|
## Links |
|
|
|
* **Paper**: [https://arxiv.org/pdf/2306.07629.pdf](https://arxiv.org/pdf/2306.07629.pdf) |
|
* **Code**: [https://github.com/SqueezeAILab/SqueezeLLM](https://github.com/SqueezeAILab/SqueezeLLM) |
|
|
|
|
|
--- |
|
license: other |
|
--- |
|
|