**SqueezeLLM** is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving. **TLDR:** Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For more details please check out our [paper](https://arxiv.org/pdf/2306.07629.pdf). ## Model description 3-bit XGen-7B instruction-tuned model (i.e. finetuned model on public domain instructional data) with 8K sequence length quantized using SqueezeLLM. More details on the quantization method can be found in the [paper](https://arxiv.org/pdf/2306.07629.pdf). More detailed model descriptions can be found in the [link](https://huggingface.co/Salesforce/xgen-7b-8k-inst). * **Base Model:** [XGen-7B-8K-Inst](https://huggingface.co/Salesforce/xgen-7b-8k-inst) (by Salesforce AI Research) * **Bitwidth:** 3-bit * **Sparsity Level:** 0% (dense-only) ## Links * **Paper**: [https://arxiv.org/pdf/2306.07629.pdf](https://arxiv.org/pdf/2306.07629.pdf) * **Code**: [https://github.com/SqueezeAILab/SqueezeLLM](https://github.com/SqueezeAILab/SqueezeLLM) --- license: other ---