Thireus
/

Vicuna13B-v1.1-8bit-128g

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Thireus commited on Apr 16, 2023

Commit

e0d32b5

·

1 Parent(s): 4784f92

Update README.md

Files changed (1) hide show

README.md +10 -0

README.md CHANGED Viewed

@@ -6,6 +6,16 @@ tags:
 ---
 ![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png)
 **This model is a 8bit quantization of Vicuna 13B.**
 - 13B parameters
 - Group size: 128

 ---
 ![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png)
+Q. Why quantized in 8bit instead of 4bit?
+A. In theory, a 8bit quantized model should provide slightly better perplexity (maybe not noticeable - To Be Evaluated...) over a 4bit quatized version. If your available GPU VRAM is over 15GB you may want to try this out.
+Note that quatization in 8bit does not mean loading the model in 8bit precision. Loading your model in 8bit precision (--load-in-8bit) definitely comes with a non-linear quality (perplexity) degradation.
+Refs:
+- https://github.com/ggerganov/llama.cpp/pull/951
+- https://news.ycombinator.com/item?id=35148542
+- https://arxiv.org/abs/2105.03536
+- https://github.com/IST-DASLab/gptq
 **This model is a 8bit quantization of Vicuna 13B.**
 - 13B parameters
 - Group size: 128