Edit model card

Quantization 4Bits - 5.02 GB GPU memory usage for inference:

** Vide same fine-tuning for GPT-J-6B: https://huggingface.co/nlpulse/gpt-j-6b-english_quotes

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   1  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 65%   74C    P2   169W / 170W |   5028MiB / 12288MiB |     97%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Fine-tuning

3 epochs, all dataset samples (split=train), 939 steps
1 x GPU NVidia RTX 3060 12GB - max. GPU memory: 6.85 GB
Duration: 1h54min

$ nvidia-smi && free -h
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   1  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
|100%   87C    P2   168W / 170W |   6854MiB / 12288MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
               total        used        free      shared  buff/cache   available
Mem:            77Gi        13Gi       1.1Gi       116Mi        63Gi        63Gi
Swap:           37Gi       3.8Gi        34Gi

Inference

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftConfig, PeftModel

model_path = "nlpulse/llama2-7b-chat-english_quotes"

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token

# quantization config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# model adapter PEFT LoRA
config = PeftConfig.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
        quantization_config=quant_config, device_map={"":0}, use_auth_token=True)
model = PeftModel.from_pretrained(model, model_path)

# inference
device = "cuda"
text_list = ["Ask not what your country", "Be the change that", "You only live once, but", "I'm selfish, impatient and"]
for text in text_list:
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=60)
    print('>> ', text, " => ", tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

pip install -U bitsandbytes
pip install -U git+https://github.com/huggingface/transformers.git 
pip install -U git+https://github.com/huggingface/peft.git
pip install -U accelerate
pip install -U datasets
pip install -U scipy

Scripts

https://github.com/nlpulse-io/sample_codes/tree/main/fine-tuning/peft_quantization_4bits/llama2-7b-chat

References

QLoRa: Fine-Tune a Large Language Model on Your GPU

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Training procedure

The following bitsandbytes quantization config was used during training:

  • load_in_8bit: False
  • load_in_4bit: True
  • llm_int8_threshold: 6.0
  • llm_int8_skip_modules: None
  • llm_int8_enable_fp32_cpu_offload: False
  • llm_int8_has_fp16_weight: False
  • bnb_4bit_quant_type: nf4
  • bnb_4bit_use_double_quant: True
  • bnb_4bit_compute_dtype: bfloat16

Framework versions

  • PEFT 0.4.0.dev0
Downloads last month
3
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train nlpulse/llama2-7b-chat-english_quotes