README.md · squeeze-ai-lab/sq-opt-13b-w4-s50 at effab4be600c6fefaea7bf97f07c8436b2255e11

SqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving.

TLDR: Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For more details please check out our paper.

Model description

4-bit quantized OPT 13B model using SqueezeLLM. More details can be found in the paper.

Base Model: OPT 13B
Bitwidth: 4-bit
Sparsity Level: 0.5%

squeeze-ai-lab
/

sq-opt-13b-w4-s50

Model description

Links

license: other