Update REAME.md
Browse files
REAME.md
CHANGED
@@ -1,3 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
2 |
## Source Code
|
3 |
|
|
|
1 |
+
# Quantize Llama 2 models using GGUF and llama.cpp
|
2 |
+
> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)
|
3 |
+
|
4 |
+
|
5 |
+
## Usage
|
6 |
+
|
7 |
+
* `MODEL_ID`: The ID of the model to quantize (e.g., `mlabonne/EvolCodeLlama-7b`).
|
8 |
+
* `QUANTIZATION_METHOD`: The quantization method to use.
|
9 |
+
|
10 |
+
## Quantization methods
|
11 |
+
|
12 |
+
The names of the quantization methods follow the naming convention: "q" + the number of bits + the variant used (detailed below). Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by [TheBloke](https://huggingface.co/TheBloke/):
|
13 |
+
|
14 |
+
* `q2_k`: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
|
15 |
+
* `q3_k_l`: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
|
16 |
+
* `q3_k_m`: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
|
17 |
+
* `q3_k_s`: Uses Q3_K for all tensors
|
18 |
+
* `q4_0`: Original quant method, 4-bit.
|
19 |
+
* `q4_1`: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
|
20 |
+
* `q4_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
|
21 |
+
* `q4_k_s`: Uses Q4_K for all tensors
|
22 |
+
* `q5_0`: Higher accuracy, higher resource usage and slower inference.
|
23 |
+
* `q5_1`: Even higher accuracy, resource usage and slower inference.
|
24 |
+
* `q5_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
|
25 |
+
* `q5_k_s`: Uses Q5_K for all tensors
|
26 |
+
* `q6_k`: Uses Q8_K for all tensors
|
27 |
+
* `q8_0`: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.
|
28 |
+
|
29 |
+
As a rule of thumb, **I recommend using Q5_K_M** as it preserves most of the model's performance. Alternatively, you can use Q4_K_M if you want to save some memory. In general, K_M versions are better than K_S versions. I cannot recommend Q2_K or Q3_* versions, as they drastically decrease model performance.
|
30 |
+
|
31 |
|
32 |
## Source Code
|
33 |
|