File size: 3,832 Bytes
c81d6ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
Quantization made by Richard Erkhov.

[Github](https://github.com/RichardErkhov)

[Discord](https://discord.gg/pvy7H8DZMG)

[Request more models](https://github.com/RichardErkhov/quant_request)


phi-2-4bit-64rank - bnb 4bits
- Model creator: https://huggingface.co/LoftQ/
- Original model: https://huggingface.co/LoftQ/phi-2-4bit-64rank/




Original model description:
---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- 'quantization '
- lora
---
# LoftQ Initialization

| [Paper](https://arxiv.org/abs/2310.08659) | [Code](https://github.com/yxli2123/LoftQ) | [PEFT Example](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning) |

LoftQ (LoRA-fine-tuning-aware Quantization) provides a quantized backbone Q and LoRA adapters A and B, given a full-precision pre-trained weight W.

This model, `phi-2-4bit-64rank`, is obtained from [phi-2](https://huggingface.co/microsoft/phi-2). 
The backbone is under `LoftQ/phi-2-4bit-64rank` and LoRA adapters are under the `subfolder='loftq_init'`.

## Model Info
### Backbone
- Stored format: `torch.float16`
- Size: ~ 5.5 GiB
- Loaded format: bitsandbytes nf4
- Size loaded on GPU: ~1.4 GiB

### LoRA adapters
- rank: 64
- lora_alpha: 16
- target_modules: ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]

## Usage

**Training** Here's an example of loading this model and preparing for the LoRA fine-tuning.

```python
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "LoftQ/phi-2-4bit-64rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    torch_dtype=torch.float32,  # you may change it with different models
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float32,  # float32 is tested and veryfied
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_ID,
    subfolder="loftq_init",
    is_trainable=True,
)

# Do training with peft_model ...
```

## Experiment Results
We have conducted experiments on supervised fine-tuning of [GSM8K](https://huggingface.co/datasets/gsm8k).

| Model   | Bits | Rank | LoRA Initial           | GSM8K     |
| --------| ---- | ---- | ---------------------- | --------- |
| Phi-2   | 16   | -    | Full model fine-tuning | 66.8±1.2  |
| Phi-2   | 16   | 64   | Gaussian + 0 (LoRA)    | 64.8±0.5  |
| Phi-2   | 4    | 64   | Gaussian + 0 (QLoRA)   | 60.2±0.6  |
| Phi-2   | 4    | 64   | LoftQ                  | 64.1±0.7  |



**Inference** Here is an example code for inference after the model has been fine-tuned on [GSM8K](https://huggingface.co/datasets/gsm8k).

```python
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "LoftQ/phi-2-4bit-64rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    torch_dtype=torch.float32,  # you may change it with different models
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float32,  # float32 is tested and veryfied
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_ID,
    subfolder="gsm8k",
    is_trainable=True,
)

# Do inference with peft_model ...
```
See the full code at our [Github Repo]((https://github.com/yxli2123/LoftQ))


## Citation

```bibtex
@article{li2023loftq,
  title={Loftq: Lora-fine-tuning-aware quantization for large language models},
  author={Li, Yixiao and Yu, Yifan and Liang, Chen and He, Pengcheng and Karampatziakis, Nikos and Chen, Weizhu and Zhao, Tuo},
  journal={arXiv preprint arXiv:2310.08659},
  year={2023}
}
```