You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v1.23.3).
Brevitas is an AMD library for neural network quantization. 🤗 Optimum-AMD integrates with Brevitas so as to make it easier to quantize Transformers models through Brevitas.
This integration also allows to export models quantized through Brevitas to ONNX.
For a refresher on quantization, please have a look at this documentation.
Please refer to ~BrevitasQuantizer and ~BrevitasQuantizationConfig for all available options.
Supported models
Currently, only the following architectures are tested and supported:
- Llama
- OPT
Dynamic quantization
from optimum.amd import BrevitasQuantizationConfig, BrevitasQuantizer
from transformers import AutoTokenizer
# Prepare the quantizer, specifying its configuration and loading the model.
qconfig = BrevitasQuantizationConfig(
is_static=False,
apply_gptq=False,
apply_weight_equalization=False,
activations_equalization=False,
weights_symmetric=True,
activations_symmetric=False,
)
quantizer = BrevitasQuantizer.from_pretrained("facebook/opt-125m")
model = quantizer.quantize(qconfig)
Static quantization
from optimum.amd import BrevitasQuantizationConfig, BrevitasQuantizer
from transformers import AutoTokenizer
# Prepare the quantizer, specifying its configuration and loading the model.
qconfig = BrevitasQuantizationConfig(
is_static=True,
apply_gptq=False,
apply_weight_equalization=True,
activations_equalization=False,
weights_symmetric=True,
activations_symmetric=False,
)
quantizer = BrevitasQuantizer.from_pretrained("facebook/opt-125m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
# Load the data for calibration and evaluation.
calibration_dataset = get_dataset_for_model(
"facebook/opt-125m",
qconfig=qconfig,
dataset_name="wikitext2",
tokenizer=tokenizer,
nsamples=128,
seqlen=512,
split="train",
)
model = quantizer.quantize(qconfig, calibration_dataset)
Export Brevitas models to ONNX
Brevitas models can be exported to ONNX using Optimum:
import torch
from optimum.amd.brevitas.export import onnx_export_from_quantized_model
# Export to ONNX through optimum.exporters.
onnx_export_from_quantized_model(model, "llm_quantized_onnx")
Complete example
A complete example is available at https://github.com/huggingface/optimum-amd/tree/main/examples/quantization/brevitas.
< > Update on GitHub