4-bit quantization
QLoRA is a finetuning method that quantizes a model to 4-bits and adds a set of low-rank adaptation (LoRA) weights to the model and tuning them through the quantized weights. This method also introduces a new data type, 4-bit NormalFloat (LinearNF4
) in addition to the standard Float4 data type (LinearFP4
). LinearNF4
is a quantization data type for normally distributed data and can improve performance.
Linear4bit
class bitsandbytes.nn.Linear4bit
< source >( input_features output_features bias = True compute_dtype = None compress_statistics = True quant_type = 'fp4' quant_storage = torch.uint8 device = None )
This class is the base module for the 4-bit quantization algorithm presented in QLoRA. QLoRA 4-bit linear layers uses blockwise k-bit quantization under the hood, with the possibility of selecting various compute datatypes such as FP4 and NF4.
In order to quantize a linear layer one should first load the original fp16 / bf16 weights into
the Linear4bit module, then call quantized_module.to("cuda")
to quantize the fp16 / bf16 weights.
Example:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from bnb.nn import Linear4bit
fp16_model = nn.Sequential(
nn.Linear(64, 64),
nn.Linear(64, 64)
)
quantized_model = nn.Sequential(
Linear4bit(64, 64),
Linear4bit(64, 64)
)
quantized_model.load_state_dict(fp16_model.state_dict())
quantized_model = quantized_model.to(0) # Quantization happens here
__init__
< source >( input_features output_features bias = True compute_dtype = None compress_statistics = True quant_type = 'fp4' quant_storage = torch.uint8 device = None )
Initialize Linear4bit class.
LinearFP4
[[autdodoc]] bitsandbytes.nn.LinearFP4
- init
LinearNF4
class bitsandbytes.nn.LinearNF4
< source >( input_features output_features bias = True compute_dtype = None compress_statistics = True quant_storage = torch.uint8 device = None )
Implements the NF4 data type.
Constructs a quantization data type where each bin has equal area under a standard normal distribution N(0, 1) that is normalized into the range [-1, 1].
For more information read the paper: QLoRA: Efficient Finetuning of Quantized LLMs (https://arxiv.org/abs/2305.14314)
Implementation of the NF4 data type in bitsandbytes can be found in the create_normal_map
function in
the functional.py
file: https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/functional.py#L236.
__init__
< source >( input_features output_features bias = True compute_dtype = None compress_statistics = True quant_storage = torch.uint8 device = None )
Params4bit
class bitsandbytes.nn.Params4bit
< source >( data: typing.Optional[torch.Tensor] = None requires_grad = False quant_state: typing.Optional[bitsandbytes.functional.QuantState] = None blocksize: int = 64 compress_statistics: bool = True quant_type: str = 'fp4' quant_storage: dtype = torch.uint8 module: typing.Optional[ForwardRef('Linear4bit')] = None bnb_quantized: bool = False )