Transformers

With Transformers it’s very easy to load any model in 4 or 8-bit, quantizing them on the fly with bitsandbytes primitives.

Please review the bitsandbytes section in the Transformers docs.

Details about the BitsAndBytesConfig can be found here.

Beware: bf16 is the optimal compute data type!

If your hardware supports it, bf16 is the optimal compute dtype. The default is float32 for backward compatibility and numerical stability. float16 often leads to numerical instabilities, but bfloat16 provides the benefits of both worlds: numerical stability equivalent to float32, but combined with the memory footprint and significant computation speedup of a 16-bit data type. Therefore, be sure to check if your hardware supports bf16 and configure it using the bnb_4bit_compute_dtype parameter in BitsAndBytesConfig:

import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

PEFT

With PEFT, you can use QLoRA out of the box with LoraConfig and a 4-bit base model.

Please review the bitsandbytes section in the PEFT docs.

Accelerate

Bitsandbytes is also easily usable from within Accelerate, where you can quantize any PyTorch model simply by passing a quantization config; e.g:

from accelerate import init_empty_weights
from accelerate.utils import BnbQuantizationConfig, load_and_quantize_model
from mingpt.model import GPT

model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

with init_empty_weights():
    empty_model = GPT(model_config)

bnb_quantization_config = BnbQuantizationConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.bfloat16,  # optional
  bnb_4bit_use_double_quant=True,         # optional
  bnb_4bit_quant_type="nf4"               # optional
)

quantized_model = load_and_quantize_model(
  empty_model,
  weights_location=weights_location,
  bnb_quantization_config=bnb_quantization_config,
  device_map = "auto"
)

For further details, e.g. model saving, cpu-offloading andfine-tuning, please review the bitsandbytes section in the Accelerate docs.

PyTorch Lightning and Lightning Fabric

Bitsandbytes is available from within both

PyTorch Lightning, a deep learning framework for professional AI researchers and machine learning engineers who need maximal flexibility without sacrificing performance at scale;
and Lightning Fabric, a fast and lightweight way to scale PyTorch models without boilerplate).

Please review the bitsandbytes section in the PyTorch Lightning docs.

Lit-GPT

Bitsandbytes is integrated into Lit-GPT, a hackable implementation of state-of-the-art open-source large language models, based on Lightning Fabric, where it can be used for quantization during training, finetuning, and inference.

Please review the bitsandbytes section in the Lit-GPT quantization docs.

Trainer for the optimizers

You can use any of the 8-bit and/or paged optimizers by simple passing them to the transformers.Trainer class on initialization.All bnb optimizers are supported by passing the correct string in TrainingArguments’s optim attribute - e.g. (paged_adamw_32bit).

See the official API docs for reference.

Here we point out to relevant doc sections in transformers / peft / Trainer + very briefly explain how these are integrated: e.g. for transformers state that you can load any model in 8-bit / 4-bit precision, for PEFT, you can use QLoRA out of the box with LoraConfig + 4-bit base model, for Trainer: all bnb optimizers are supported by passing the correct string in TrainingArguments’s optim attribute - e.g. (paged_adamw_32bit):