Transformers
With Transformers it’s very easy to load any model in 4 or 8-bit, quantizing them on the fly with bitsandbytes
primitives.
Please review the bitsandbytes
section in the Transformers docs.
Details about the BitsAndBytesConfig can be found here.
Beware: bf16 is the optimal compute data type!
If your hardware supports it, bf16
is the optimal compute dtype. The default is float32
for backward compatibility and numerical stability. float16
often leads to numerical instabilities, but bfloat16
provides the benefits of both worlds: numerical stability equivalent to float32, but combined with the memory footprint and significant computation speedup of a 16-bit data type. Therefore, be sure to check if your hardware supports bf16
and configure it using the bnb_4bit_compute_dtype
parameter in BitsAndBytesConfig:
import torch
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
PEFT
With PEFT
, you can use QLoRA out of the box with LoraConfig
and a 4-bit base model.
Please review the bitsandbytes section in the PEFT docs.
Accelerate
Bitsandbytes is also easily usable from within Accelerate, where you can quantize any PyTorch model simply by passing a quantization config; e.g:
from accelerate import init_empty_weights
from accelerate.utils import BnbQuantizationConfig, load_and_quantize_model
from mingpt.model import GPT
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024
with init_empty_weights():
empty_model = GPT(model_config)
bnb_quantization_config = BnbQuantizationConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # optional
bnb_4bit_use_double_quant=True, # optional
bnb_4bit_quant_type="nf4" # optional
)
quantized_model = load_and_quantize_model(
empty_model,
weights_location=weights_location,
bnb_quantization_config=bnb_quantization_config,
device_map = "auto"
)
For further details, e.g. model saving, cpu-offloading andfine-tuning, please review the bitsandbytes
section in the Accelerate docs.
PyTorch Lightning and Lightning Fabric
Bitsandbytes is available from within both
- PyTorch Lightning, a deep learning framework for professional AI researchers and machine learning engineers who need maximal flexibility without sacrificing performance at scale;
- and Lightning Fabric, a fast and lightweight way to scale PyTorch models without boilerplate).
Please review the bitsandbytes section in the PyTorch Lightning docs.
Lit-GPT
Bitsandbytes is integrated into Lit-GPT, a hackable implementation of state-of-the-art open-source large language models, based on Lightning Fabric, where it can be used for quantization during training, finetuning, and inference.
Please review the bitsandbytes section in the Lit-GPT quantization docs.
Trainer for the optimizers
You can use any of the 8-bit and/or paged optimizers by simple passing them to the transformers.Trainer
class on initialization.All bnb optimizers are supported by passing the correct string in TrainingArguments
’s optim
attribute - e.g. (paged_adamw_32bit
).
See the official API docs for reference.
Here we point out to relevant doc sections in transformers / peft / Trainer + very briefly explain how these are integrated:
e.g. for transformers state that you can load any model in 8-bit / 4-bit precision, for PEFT, you can use QLoRA out of the box with LoraConfig
+ 4-bit base model, for Trainer: all bnb optimizers are supported by passing the correct string in TrainingArguments
’s optim
attribute - e.g. (paged_adamw_32bit
):