Update README.md
Browse files
README.md
CHANGED
@@ -6,6 +6,16 @@ tags:
|
|
6 |
---
|
7 |

|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
**This model is a 8bit quantization of Vicuna 13B.**
|
10 |
- 13B parameters
|
11 |
- Group size: 128
|
|
|
6 |
---
|
7 |

|
8 |
|
9 |
+
Q. Why quantized in 8bit instead of 4bit?
|
10 |
+
A. In theory, a 8bit quantized model should provide slightly better perplexity (maybe not noticeable - To Be Evaluated...) over a 4bit quatized version. If your available GPU VRAM is over 15GB you may want to try this out.
|
11 |
+
Note that quatization in 8bit does not mean loading the model in 8bit precision. Loading your model in 8bit precision (--load-in-8bit) definitely comes with a non-linear quality (perplexity) degradation.
|
12 |
+
|
13 |
+
Refs:
|
14 |
+
- https://github.com/ggerganov/llama.cpp/pull/951
|
15 |
+
- https://news.ycombinator.com/item?id=35148542
|
16 |
+
- https://arxiv.org/abs/2105.03536
|
17 |
+
- https://github.com/IST-DASLab/gptq
|
18 |
+
|
19 |
**This model is a 8bit quantization of Vicuna 13B.**
|
20 |
- 13B parameters
|
21 |
- Group size: 128
|