squeeze-ai-lab
/

dbrx-instruct-a2-s1

Model card Files Files and versions Community

squeeze-ai-lab commited on Apr 19

Commit

a6292a8

•

1 Parent(s): 1c09791

Update README

Files changed (1) hide show

README.md +29 -1

README.md CHANGED Viewed

@@ -1,3 +1,31 @@
 ---
 license: mit
----

+**KVQuant** is a methodology for efficient KV cache quantization that incorporates several innovations to acheive accurate low-precision quantization,
+thereby enabling efficient long context length inference.
+**TLDR:** KVQuant addresses the memory bottleneck with long context length inference by quantizing the KV cache to low precision.
+KVQuant achieves high accuracy with low-precision KV cache quantization by considering several consistent patterns observed in cached KV values across different LLMs,
+and by developing methods to exploit these patterns, including:
+- **Per-channel, Pre-RoPE** Key quantization to better match the outlier channels in Keys
+- Non-Uniform Quantization (**NUQ**) to better represent the non-uniform activations
+- **Dense-and-Sparse Quantization** to mitigate the impacts of numerical outliers on quantization difficulty
+- **Q-Norm** to mitigate distribution shift at ultra low precisions (eg. 2-bit)
+- **Attention-Sink Aware Quantization** to avoid quantization error with the first token, which is disproportionately sensitive to quantization error
+For more details please check out our [paper](https://arxiv.org/abs/2401.18079.pdf).
+## Model description
+Quantizer file for running DBRX with 2-bit KV cache using KVQuant.
+* **Base Model:** [DBRX-Instruct](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm)
+* **Bitwidth:** 2-bit
+* **Sparsity Level:** 1%
+## Links
+* **Paper**: [https://arxiv.org/abs/2401.18079.pdf](https://arxiv.org/abs/2401.18079.pdf)
+* **Code**: [https://github.com/SqueezeAILab/KVQuant](https://github.com/SqueezeAILab/KVQuant)
 ---
 license: mit
+---