Transformers documentation
HIGGS
HIGGS
HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper arxiv.org/abs/2411.17525.
Runtime support for HIGGS is implemented through FLUTE, and its library.
Quantization Example
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
quantization_config=HiggsConfig(bits=4),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
tokenizer.decode(model.generate(
**tokenizer("Hi,", return_tensors="pt").to(model.device),
temperature=0.5,
top_p=0.80,
)[0])Pre-quantized models
Some pre-quantized models can be found in the official collection on Hugging Face Hub.
Current Limitations
Architectures
Currently, FLUTE, and HIGGS by extension, only support Llama 3 and 3.0 of 8B, 70B and 405B parameters, as well as Gemma-2 9B and 27B. We’re working on allowing to run more diverse models as well as allow arbitrary models by modifying the FLUTE compilation procedure.
torch.compile
HIGGS is fully compatible with torch.compile. Compiling model.forward, as described here, here’re the speedups it provides on RTX 4090 for Llama-3.1-8B-Instruct (forward passes/sec):
| Batch Size | BF16 (With torch.compile) | HIGGS 4bit (No torch.compile) | HIGGS 4bit (With torch.compile) |
|---|---|---|---|
| 1 | 59 | 41 | 124 |
| 4 | 57 | 42 | 123 |
| 16 | 56 | 41 | 120 |
Quantized training
Currently, HIGGS doesn’t support quantized training (and backward passes in general). We’re working on adding support for it.
< > Update on GitHub