Supported models

Brevitas is an AMD library for neural network quantization. 🤗 Optimum-AMD integrates with Brevitas so as to make it easier to quantize Transformers models through Brevitas.

This integration also allows to export models quantized through Brevitas to ONNX.

For a refresher on quantization, please have a look at this documentation.

Please refer to ~BrevitasQuantizer and ~BrevitasQuantizationConfig for all available options.

Supported models

Currently, only the following architectures are tested and supported:

Llama
OPT

Dynamic quantization

from optimum.amd import BrevitasQuantizationConfig, BrevitasQuantizer
from transformers import AutoTokenizer

# Prepare the quantizer, specifying its configuration and loading the model.
qconfig = BrevitasQuantizationConfig(
    is_static=False,
    apply_gptq=False,
    apply_weight_equalization=False,
    activations_equalization=False,
    weights_symmetric=True,
    activations_symmetric=False,
)

quantizer = BrevitasQuantizer.from_pretrained("facebook/opt-125m")

model = quantizer.quantize(qconfig)

Static quantization

from optimum.amd import BrevitasQuantizationConfig, BrevitasQuantizer
from transformers import AutoTokenizer

# Prepare the quantizer, specifying its configuration and loading the model.
qconfig = BrevitasQuantizationConfig(
    is_static=True,
    apply_gptq=False,
    apply_weight_equalization=True,
    activations_equalization=False,
    weights_symmetric=True,
    activations_symmetric=False,
)

quantizer = BrevitasQuantizer.from_pretrained("facebook/opt-125m")

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")

# Load the data for calibration and evaluation.
calibration_dataset = get_dataset_for_model(
    "facebook/opt-125m",
    qconfig=qconfig,
    dataset_name="wikitext2",
    tokenizer=tokenizer,
    nsamples=128,
    seqlen=512,
    split="train",
)

model = quantizer.quantize(qconfig, calibration_dataset)

Export Brevitas models to ONNX

Brevitas models can be exported to ONNX using Optimum:

import torch
from optimum.amd.brevitas.export import onnx_export_from_quantized_model

# Export to ONNX through optimum.exporters.
onnx_export_from_quantized_model(model, "llm_quantized_onnx")

Complete example

A complete example is available at https://github.com/huggingface/optimum-amd/tree/main/examples/quantization/brevitas.

< > Update on GitHub