Transformers

You are viewing v4.49.0 version. A newer version v4.57.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

HIGGS

HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper arxiv.org/abs/2411.17525.

Runtime support for HIGGS is implemented through FLUTE, and its library.

Quantization Example

from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b-it",
    quantization_config=HiggsConfig(bits=4),
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")

tokenizer.decode(model.generate(
    **tokenizer("Hi,", return_tensors="pt").to(model.device),
    temperature=0.5,
    top_p=0.80,
)[0])

Pre-quantized models

Some pre-quantized models can be found in the official collection on Hugging Face Hub.

Current Limitations

Architectures

Currently, FLUTE, and HIGGS by extension, only support Llama 3 and 3.0 of 8B, 70B and 405B parameters, as well as Gemma-2 9B and 27B. We’re working on allowing to run more diverse models as well as allow arbitrary models by modifying the FLUTE compilation procedure.

torch.compile

HIGGS is fully compatible with torch.compile. Compiling model.forward, as described here, here’re the speedups it provides on RTX 4090 for Llama-3.1-8B-Instruct (forward passes/sec):

Batch Size	BF16 (With `torch.compile`)	HIGGS 4bit (No `torch.compile`)	HIGGS 4bit (With `torch.compile`)
1	59	41	124
4	57	42	123
16	56	41	120

Quantized training

Currently, HIGGS doesn’t support quantized training (and backward passes in general). We’re working on adding support for it.

< > Update on GitHub

←EETQ HQQ→