Transformers documentation

HIGGS

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

HIGGS

HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper arxiv.org/abs/2411.17525.

Runtime support for HIGGS is implemented through FLUTE, and its library.

Quantization Example

from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b-it",
    quantization_config=HiggsConfig(bits=4),
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")

tokenizer.decode(model.generate(
    **tokenizer("Hi,", return_tensors="pt").to(model.device),
    temperature=0.5,
    top_p=0.80,
)[0])

Pre-quantized models

Some pre-quantized models can be found in the official collection on Hugging Face Hub.

Current Limitations

Architectures

Currently, FLUTE, and HIGGS by extension, only support Llama 3 and 3.0 of 8B, 70B and 405B parameters, as well as Gemma-2 9B and 27B. We’re working on allowing to run more diverse models as well as allow arbitrary models by modifying the FLUTE compilation procedure.

torch.compile

HIGGS is fully compatible with torch.compile. Compiling model.forward, as described here, here’re the speedups it provides on RTX 4090 for Llama-3.1-8B-Instruct (forward passes/sec):

Batch Size BF16 (With torch.compile) HIGGS 4bit (No torch.compile) HIGGS 4bit (With torch.compile)
1 59 41 124
4 57 42 123
16 56 41 120

Quantized training

Currently, HIGGS doesn’t support quantized training (and backward passes in general). We’re working on adding support for it.

< > Update on GitHub