File size: 3,832 Bytes
c81d6ad |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
Quantization made by Richard Erkhov.
[Github](https://github.com/RichardErkhov)
[Discord](https://discord.gg/pvy7H8DZMG)
[Request more models](https://github.com/RichardErkhov/quant_request)
phi-2-4bit-64rank - bnb 4bits
- Model creator: https://huggingface.co/LoftQ/
- Original model: https://huggingface.co/LoftQ/phi-2-4bit-64rank/
Original model description:
---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- 'quantization '
- lora
---
# LoftQ Initialization
| [Paper](https://arxiv.org/abs/2310.08659) | [Code](https://github.com/yxli2123/LoftQ) | [PEFT Example](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning) |
LoftQ (LoRA-fine-tuning-aware Quantization) provides a quantized backbone Q and LoRA adapters A and B, given a full-precision pre-trained weight W.
This model, `phi-2-4bit-64rank`, is obtained from [phi-2](https://huggingface.co/microsoft/phi-2).
The backbone is under `LoftQ/phi-2-4bit-64rank` and LoRA adapters are under the `subfolder='loftq_init'`.
## Model Info
### Backbone
- Stored format: `torch.float16`
- Size: ~ 5.5 GiB
- Loaded format: bitsandbytes nf4
- Size loaded on GPU: ~1.4 GiB
### LoRA adapters
- rank: 64
- lora_alpha: 16
- target_modules: ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
## Usage
**Training** Here's an example of loading this model and preparing for the LoRA fine-tuning.
```python
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
MODEL_ID = "LoftQ/phi-2-4bit-64rank"
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float32, # you may change it with different models
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float32, # float32 is tested and veryfied
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type='nf4',
),
)
peft_model = PeftModel.from_pretrained(
base_model,
MODEL_ID,
subfolder="loftq_init",
is_trainable=True,
)
# Do training with peft_model ...
```
## Experiment Results
We have conducted experiments on supervised fine-tuning of [GSM8K](https://huggingface.co/datasets/gsm8k).
| Model | Bits | Rank | LoRA Initial | GSM8K |
| --------| ---- | ---- | ---------------------- | --------- |
| Phi-2 | 16 | - | Full model fine-tuning | 66.8±1.2 |
| Phi-2 | 16 | 64 | Gaussian + 0 (LoRA) | 64.8±0.5 |
| Phi-2 | 4 | 64 | Gaussian + 0 (QLoRA) | 60.2±0.6 |
| Phi-2 | 4 | 64 | LoftQ | 64.1±0.7 |
**Inference** Here is an example code for inference after the model has been fine-tuned on [GSM8K](https://huggingface.co/datasets/gsm8k).
```python
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
MODEL_ID = "LoftQ/phi-2-4bit-64rank"
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float32, # you may change it with different models
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float32, # float32 is tested and veryfied
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type='nf4',
),
)
peft_model = PeftModel.from_pretrained(
base_model,
MODEL_ID,
subfolder="gsm8k",
is_trainable=True,
)
# Do inference with peft_model ...
```
See the full code at our [Github Repo]((https://github.com/yxli2123/LoftQ))
## Citation
```bibtex
@article{li2023loftq,
title={Loftq: Lora-fine-tuning-aware quantization for large language models},
author={Li, Yixiao and Yu, Yifan and Liang, Chen and He, Pengcheng and Karampatziakis, Nikos and Chen, Weizhu and Zhao, Tuo},
journal={arXiv preprint arXiv:2310.08659},
year={2023}
}
```
|