LearnItAnyway
commited on
Commit
•
c3fb936
1
Parent(s):
b2d3af8
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,43 @@
|
|
1 |
---
|
2 |
license: other
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: other
|
3 |
---
|
4 |
+
# Model Card for llama-30b-hf-53q_4bit-128g_WVU
|
5 |
+
|
6 |
+
## Model Description
|
7 |
+
|
8 |
+
`llama-30b-hf-53q_4bit-128g_WVU` is a model based on the
|
9 |
+
Llama architecture with 30 billion parameters.
|
10 |
+
This model adopts a quantization in which the first 53 layers
|
11 |
+
of the decoder have been quantized with the [`gptq`](https://github.com/qwopqwop200/GPTQ-for-LLaMa) method,
|
12 |
+
which uses 4-bit precision and 128 groups.
|
13 |
+
Then, the last 7 decoder layers (1/8 of decoding layers), and lm_head have been fine-tuned using the [wizard_vicuna_70k_unfiltered dataset](https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered), 1 epoch.
|
14 |
+
|
15 |
+
## Note
|
16 |
+
|
17 |
+
Quantization effectively reduces memory usage, however, it may result in differences in the parameters.
|
18 |
+
Additionally, fine-tuning only the last few layers lowers memory requirements for training but could lead to minor performance degradation.
|
19 |
+
|
20 |
+
Several alternatives exist for fine-tuning and quantizing the Llama models. The specific method utilized here—quantizing several layers,
|
21 |
+
followed by fine-tuning the last few layers—is designed to account for errors introduced during quantization (which sometimes can result in unexpected answers),
|
22 |
+
and enables the last few layers to be fine-tuned considering both the quantization error and the dataset.
|
23 |
+
|
24 |
+
It is worth mentioning that other methods may yield superior performance. For instance:
|
25 |
+
1. Fine-tuning the entire model for `X` epochs
|
26 |
+
2. Quantizing the first `K` layers
|
27 |
+
3. Fine-tuning the remaining layers for `Y` epochs
|
28 |
+
|
29 |
+
Nonetheless, as fine-tuning the entire model requires considerable resources (for example, 4 GPUs with 80GB VRAM is required for 7B LLaMa),
|
30 |
+
this model omit the first step from the method described above, and it works.
|
31 |
+
|
32 |
+
## Using the Model
|
33 |
+
|
34 |
+
To load the model, a custom `LlamaForCausalLM` is required.
|
35 |
+
You can find quantized llama [here](https://github.com/LearnItAnyway/quantized_llama).
|
36 |
+
|
37 |
+
## References
|
38 |
+
|
39 |
+
1. Meta - LLaMA
|
40 |
+
2. [WizardLM](https://github.com/nlpxucan/WizardLM)
|
41 |
+
3. [GPTQ for LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa)
|
42 |
+
4. [Wizard Vicuna Unfiltered Dataset](https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered)
|
43 |
+
5. Various unlisted but great works, researches, and projects.
|