squeeze-ai-lab
/

sq-llama-65b-w4-s5

Model card Files Files and versions Community

squeeze-ai-lab commited on Jul 2, 2023

Commit

5e3f75b

•

1 Parent(s): 0c6fff4

Update README.md

Files changed (1) hide show

README.md +24 -0

README.md CHANGED Viewed

@@ -1,3 +1,27 @@
 ---
 license: other
 ---

+**SqueezeLLM** is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving.
+**TLDR:** Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization.
+But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method.
+Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance,
+as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach,
+we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality.
+For more details please check out our [paper](https://arxiv.org/pdf/2306.07629.pdf).
+## Model description
+4-bit quantized LLaMA 65B model using SqueezeLLM. More details can be found in the [paper](https://arxiv.org/pdf/2306.07629.pdf).
+* **Base Model:** [LLaMA 65B](https://arxiv.org/abs/2302.13971)
+* **Bitwidth:** 4-bit
+* **Sparsity Level:** 0.05%
+## Links
+* **Paper**: [https://arxiv.org/pdf/2306.07629.pdf](https://arxiv.org/pdf/2306.07629.pdf)
+* **Code**: [https://github.com/SqueezeAILab/SqueezeLLM](https://github.com/SqueezeAILab/SqueezeLLM)
 ---
 license: other
 ---