Transformers documentation

AQLM

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.41.3).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

AQLM

Try AQLM on Google Colab!

Additive Quantization of Language Models (AQLM) is a Large Language Models compression method. It quantizes multiple weights together and take advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.

Inference support for AQLM is realised in the aqlm library. Make sure to install it to run the models (note aqlm works only with python>=3.10):

pip install aqlm[gpu,cpu]

The library provides efficient kernels for both GPU and CPU inference and training.

The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub repository. To run AQLM models simply load a model that has been quantized with AQLM:

from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
    torch_dtype="auto", 
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")

PEFT

Starting with version aqlm 1.0.2, AQLM supports Parameter-Efficient Fine-Tuning in a form of LoRA integrated into the PEFT library.

AQLM configurations

AQLM quantization setups vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:

Kernel Number of codebooks Codebook size, bits Notation Accuracy Speedup Fast GPU inference Fast CPU inference
Triton K N KxN - Up to ~0.7x ✅ ❌
CUDA 1 16 1x16 Best Up to ~1.3x ✅ ❌
CUDA 2 8 2x8 OK Up to ~3.0x ✅ ❌
Numba K 8 Kx8 Good Up to ~4.0x ❌ ✅
< > Update on GitHub