File size: 2,240 Bytes
29f06ee
 
 
c3fb936
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
license: other
---
# Model Card for llama-30b-hf-53q_4bit-128g_WVU

## Model Description

`llama-30b-hf-53q_4bit-128g_WVU` is a model based on the 
Llama architecture with 30 billion parameters. 
This model adopts a quantization in which the first 53 layers 
of the decoder have been quantized with the [`gptq`](https://github.com/qwopqwop200/GPTQ-for-LLaMa) method, 
which uses 4-bit precision and 128 groups. 
Then, the last 7 decoder layers (1/8 of decoding layers), and lm_head have been fine-tuned using the [wizard_vicuna_70k_unfiltered dataset](https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered), 1 epoch. 

## Note

Quantization effectively reduces memory usage, however, it may result in differences in the parameters. 
Additionally, fine-tuning only the last few layers lowers memory requirements for training but could lead to minor performance degradation.

Several alternatives exist for fine-tuning and quantizing the Llama models. The specific method utilized here—quantizing several layers, 
followed by fine-tuning the last few layers—is designed to account for errors introduced during quantization (which sometimes can result in unexpected answers), 
and enables the last few layers to be fine-tuned considering both the quantization error and the dataset.

It is worth mentioning that other methods may yield superior performance. For instance:
1. Fine-tuning the entire model for `X` epochs
2. Quantizing the first `K` layers
3. Fine-tuning the remaining layers for `Y` epochs

Nonetheless, as fine-tuning the entire model requires considerable resources (for example, 4 GPUs with 80GB VRAM is required for 7B LLaMa), 
this model omit the first step from the method described above, and it works.

## Using the Model

To load the model, a custom `LlamaForCausalLM` is required. 
You can find quantized llama [here](https://github.com/LearnItAnyway/quantized_llama).

## References

1. Meta - LLaMA
2. [WizardLM](https://github.com/nlpxucan/WizardLM)
3. [GPTQ for LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa)
4. [Wizard Vicuna Unfiltered Dataset](https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered)
5. Various unlisted but great works, researches, and projects.