LoftQ Initialization

| Paper | Code | PEFT Example |

LoftQ (LoRA-fine-tuning-aware Quantization) provides a quantized backbone Q and LoRA adapters A and B, given a full-precision pre-trained weight W.

This model, Llama-2-13b-hf-4bit-64rank, is obtained from LLAMA-2-13b. The backbone is under LoftQ/Llama-2-13b-hf-4bit-64rank and LoRA adapters are under the subfolder='loftq_init'.

Model Info

Backbone

  • Stored format: torch.bfloat16
  • Size: ~ 26 GiB
  • Loaded format: bitsandbytes nf4
  • Size loaded on GPU: ~6.5 GiB

LoRA adapters

  • rank: 64
  • lora_alpha: 64
  • target_modules: ["down_proj", "up_proj", "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"]

Usage

Training Here's an example of loading this model and preparing for the LoRA fine-tuning.

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "LoftQ/Llama-2-13b-hf-4bit-64rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    torch_dtype=torch.bfloat16,  # you may change it with different models
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,  # bfloat16 is recommended
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_ID,
    subfolder="loftq_init",
    is_trainable=True,
)

# Do training with peft_model ...

Experiment Results

We have conducted experiments on supervised fine-tuning of GSM8K and WikiText-2.

Model Bits Rank LoRA Initial GSM8K WikiText-2
LLAMA-2-13b 16 64 Gaussian + 0 45.3 5.12
LLAMA-2-13b 4 64 Gaussian + 0 (QLoRA) 39.9 5.22
LLAMA-2-13b 4 64 LoftQ 45.0 5.16

Inference Here is an example code for inference after the model has been fine-tuned on GSM8K.

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "LoftQ/Llama-2-13b-hf-4bit-64rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    torch_dtype=torch.bfloat16,  # you may change it with different models
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,  # bfloat16 is recommended
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_ID,
    subfolder="gsm8k",
    is_trainable=True,
)

# Do inference with peft_model ...

See the full code at our Github Repo

Citation

@article{li2023loftq,
  title={Loftq: Lora-fine-tuning-aware quantization for large language models},
  author={Li, Yixiao and Yu, Yifan and Liang, Chen and He, Pengcheng and Karampatziakis, Nikos and Chen, Weizhu and Zhao, Tuo},
  journal={arXiv preprint arXiv:2310.08659},
  year={2023}
}
Downloads last month
60
Safetensors
Model size
13B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including LoftQ/Llama-2-13b-hf-4bit-64rank