Bitsandbytes documentation

4-bit quantization

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.45.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

4-bit quantization

QLoRA is a finetuning method that quantizes a model to 4-bits and adds a set of low-rank adaptation (LoRA) weights to the model and tuning them through the quantized weights. This method also introduces a new data type, 4-bit NormalFloat (LinearNF4) in addition to the standard Float4 data type (LinearFP4). LinearNF4 is a quantization data type for normally distributed data and can improve performance.

Linear4bit

class bitsandbytes.nn.Linear4bit

< >

( input_features output_features bias = True compute_dtype = None compress_statistics = True quant_type = 'fp4' quant_storage = torch.uint8 device = None )

This class is the base module for the 4-bit quantization algorithm presented in QLoRA. QLoRA 4-bit linear layers uses blockwise k-bit quantization under the hood, with the possibility of selecting various compute datatypes such as FP4 and NF4.

In order to quantize a linear layer one should first load the original fp16 / bf16 weights into the Linear4bit module, then call quantized_module.to("cuda") to quantize the fp16 / bf16 weights.

Example:

import torch
import torch.nn as nn

import bitsandbytes as bnb
from bnb.nn import Linear4bit

fp16_model = nn.Sequential(
    nn.Linear(64, 64),
    nn.Linear(64, 64)
)

quantized_model = nn.Sequential(
    Linear4bit(64, 64),
    Linear4bit(64, 64)
)

quantized_model.load_state_dict(fp16_model.state_dict())
quantized_model = quantized_model.to(0) # Quantization happens here

__init__

< >

( input_features output_features bias = True compute_dtype = None compress_statistics = True quant_type = 'fp4' quant_storage = torch.uint8 device = None )

Parameters

  • input_features (str) — Number of input features of the linear layer.
  • output_features (str) — Number of output features of the linear layer.
  • bias (bool, defaults to True) — Whether the linear class uses the bias term as well.

Initialize Linear4bit class.

LinearFP4

[[autdodoc]] bitsandbytes.nn.LinearFP4

  • init

LinearNF4

class bitsandbytes.nn.LinearNF4

< >

( input_features output_features bias = True compute_dtype = None compress_statistics = True quant_storage = torch.uint8 device = None )

Implements the NF4 data type.

Constructs a quantization data type where each bin has equal area under a standard normal distribution N(0, 1) that is normalized into the range [-1, 1].

For more information read the paper: QLoRA: Efficient Finetuning of Quantized LLMs (https://arxiv.org/abs/2305.14314)

Implementation of the NF4 data type in bitsandbytes can be found in the create_normal_map function in the functional.py file: https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/functional.py#L236.

__init__

< >

( input_features output_features bias = True compute_dtype = None compress_statistics = True quant_storage = torch.uint8 device = None )

Parameters

  • input_features (str) — Number of input features of the linear layer.
  • output_features (str) — Number of output features of the linear layer.
  • bias (bool, defaults to True) — Whether the linear class uses the bias term as well.

Params4bit

class bitsandbytes.nn.Params4bit

< >

( data: typing.Optional[torch.Tensor] = None requires_grad = False quant_state: typing.Optional[bitsandbytes.functional.QuantState] = None blocksize: int = 64 compress_statistics: bool = True quant_type: str = 'fp4' quant_storage: dtype = torch.uint8 module: typing.Optional[ForwardRef('Linear4bit')] = None bnb_quantized: bool = False )

__init__

( *args **kwargs )

Initialize self. See help(type(self)) for accurate signature.

< > Update on GitHub