Quantization

🤗 Optimum provides an optimum.onnxruntime package that enables you to apply quantization on many models hosted on the Hugging Face Hub using the ONNX Runtime quantization tool.

The quantization process is abstracted via the ORTConfig and the ORTQuantizer classes. The former allows you to specify how quantization should be done, while the latter effectively handles quantization.

You can read the conceptual guide on quantization to learn about quantization. It explains the main concepts that you will be using when performing quantization with the ORTQuantizer.

Creating an `ORTQuantizer`

The ORTQuantizer class is used to quantize your ONNX model. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.

Using an already initialized ORTModelForXXX class.

>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification

# Loading ONNX Model from the Hub
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(
>>>     "optimum/distilbert-base-uncased-finetuned-sst-2-english"
>>> )

# Create a quantizer from a ORTModelForXXX
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)

# Configuration
>>> ...

# Quantize the model
>>> quantizer.quantize(...)

Using a local ONNX model from a directory.

>>> from optimum.onnxruntime import ORTQuantizer

# This assumes a model.onnx exists in path/to/model
>>> quantizer = ORTQuantizer.from_pretrained("path/to/model")

# Configuration
>>> ...

# Quantize the model
>>> quantizer.quantize(...)

Apply Dynamic Quantization

The ORTQuantizer class can be used to quantize dynamically your ONNX model. Below you will find an easy end-to-end example on how to quantize dynamically distilbert-base-uncased-finetuned-sst-2-english.

>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig

>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
# Load PyTorch model and convert to ONNX
>>> onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)

# Create quantizer
>>> quantizer = ORTQuantizer.from_pretrained(onnx_model)

# Define the quantization strategy by creating the appropriate configuration
>>> dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# Quantize the model
>>> model_quantized_path = quantizer.quantize(
    save_dir="path/to/output/model",
    quantization_config=dqconfig,
)

Static Quantization example

The ORTQuantizer class can be used to quantize statically your ONNX model. Below you will find an easy end-to-end example on how to quantize statically distilbert-base-uncased-finetuned-sst-2-english.

>>> from functools import partial
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoCalibrationConfig

>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"

# Load PyTorch model and convert to ONNX and create Quantizer and setup config
>>> onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> quantizer = ORTQuantizer.from_pretrained(onnx_model)
>>> qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)

# Create the calibration dataset
>>> def preprocess_fn(ex, tokenizer):
    return tokenizer(ex["sentence"])

>>> calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
    num_samples=50,
    dataset_split="train",
)
# Create the calibration configuration containing the parameters related to calibration.
>>> calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)

# Perform the calibration step: computes the activations quantization ranges
>>> ranges = quantizer.fit(
    dataset=calibration_dataset,
    calibration_config=calibration_config,
    operators_to_quantize=qconfig.operators_to_quantize,
)

# Apply static quantization on the model
>>> model_quantized_path = quantizer.quantize(
    save_dir="path/to/output/model",
    calibration_tensors_range=ranges,
    quantization_config=qconfig,
)

Quantize Seq2Seq models

The ORTQuantizer class currently doesn’t support multi-file models, like ORTModelForSeq2SeqLM. If you want to quantize a Seq2Seq model, you have to quantize each model’s component individually.

Currently, only dynamic quantization is supported for Seq2Seq models.

Load seq2seq model as ORTModelForSeq2SeqLM.

>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig

# load Seq2Seq model and set model file directory
>>> model_id = "optimum/t5-small"
>>> onnx_model = ORTModelForSeq2SeqLM.from_pretrained(model_id)
>>> model_dir = onnx_model.model_save_dir

Define Quantizer for encoder, decoder and decoder with past keys

# Create encoder quantizer
>>> encoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="encoder_model.onnx")

# Create decoder quantizer
>>> decoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_model.onnx")

# Create decoder with past key values quantizer
>>> decoder_wp_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_with_past_model.onnx")

# Create Quantizer list
>>> quantizer = [encoder_quantizer, decoder_quantizer, decoder_wp_quantizer]

Quantize all models

# Define the quantization strategy by creating the appropriate configuration
>>> dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# Quantize the models individually
>>> [q.quantize(save_dir=".",quantization_config=dqconfig) for q in quantizer]

Optimum

Quantization

Creating an ORTQuantizer

Apply Dynamic Quantization

Static Quantization example

Quantize Seq2Seq models

Creating an `ORTQuantizer`