Quickstart

At its core, 🤗 Optimum uses configuration objects to define parameters for optimization on different accelerators. These objects are then used to instantiate dedicated optimizers, quantizers, and pruners. For example, here’s how you can apply dynamic quantization with ONNX Runtime:

>>> from optimum.onnxruntime import ORTConfig, ORTQuantizer

>>> # The model we wish to quantize
>>> model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
>>> # The type of quantization to apply
>>> qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
>>> quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature="sequence-classification")
>>> # Quantize the model!
>>> quantizer.fit(model_checkpoint, output_dir=".", feature="sequence-classification")
>>> quantizer = ORTQuantizer(
...     onnx_model_path="model.onnx",
...     onnx_quantized_model_output_path="model-quantized.onnx",
...     quantization_config=qconfig,
... )

In this example, we’ve quantized a model from the Hugging Face Hub, but it could also be a path to a local model directory. The feature argument in the from_pretrained() method corresponds to the type of task that we wish to quantize the model for. The result from applying the export() method is a model-quantized.onnx file that can be used to run inference. Here’s an example of how to load an ONNX Runtime model and generate predictions with it:

>>> from functools import partial
>>> from datasets import Dataset
>>> from optimum.onnxruntime import ORTModel

>>> # Load quantized model
>>> ort_model = ORTModel("model-quantized.onnx", quantizer._onnx_config)
>>> # Create a dataset or load one from the Hub
>>> ds = Dataset.from_dict({"sentence": ["I love burritos!"]})
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

>>> def preprocess_fn(ex, tokenizer):
...     return tokenizer(ex["sentence"])

>>> # Tokenize the inputs
>>> tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=quantizer.tokenizer))
>>> ort_outputs = ort_model.evaluation_loop(tokenized_ds)
>>> # Extract logits!
>>> ort_outputs.predictions

Similarly, you can apply static quantization by simply setting is_static to True when instantiating the QuantizationConfig object:

>>> qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)

Static quantization relies on feeding batches of data through the model to estimate the activation quantization parameters ahead of inference time. To support this, 🤗 Optimum allows you to provide a calibration dataset. The calibration dataset can be a simple Dataset object from the 🤗 Datasets library, or any dataset that’s hosted on the Hugging Face Hub. For this example, we’ll pick the sst2 dataset that the model was originally trained on:

>>> from optimum.onnxruntime.configuration import AutoCalibrationConfig

>>> # Create the calibration dataset
>>> calibration_dataset = quantizer.get_calibration_dataset(
...     "glue",
...     dataset_config_name="sst2",
...     preprocess_function=partial(preprocess_fn, tokenizer=quantizer.tokenizer),
...     num_samples=50,
...     dataset_split="train",
... )
>>> # Create the calibration configuration containing the parameters related to calibration.
>>> calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)
>>> # Perform the calibration step: computes the activations quantization ranges
>>> ranges = quantizer.fit(
...     dataset=calibration_dataset,
...     calibration_config=calibration_config,
...     onnx_model_path="model.onnx",
...     operators_to_quantize=qconfig.operators_to_quantize,
... )
>>> # Quantize the same way we did for dynamic quantization!
>>> quantizer.export(
...     onnx_model_path="model.onnx",
...     onnx_quantized_model_output_path="model-quantized.onnx",
...     calibration_tensors_range=ranges,
...     quantization_config=qconfig,
... )

As a final example, let’s take a look at applying graph optimizations techniques such as operator fusion and constant folding. As before, we load a configuration object, but this time by setting the optimization level instead of the quantization approach:

>>> from optimum.onnxruntime.configuration import OptimizationConfig

>>> # optimization_config=99 enables all available graph optimisations
>>> optimization_config = OptimizationConfig(optimization_level=99)

Next, we load an optimizer to apply these optimisations to our model:

>>> from optimum.onnxruntime import ORTOptimizer

>>> optimizer = ORTOptimizer.from_pretrained(
...     model_checkpoint,
...     feature="sequence-classification",
... )
>>> # Export the optimized model
>>> optimizer.export(
...     onnx_model_path="model.onnx",
...     onnx_optimized_model_output_path="model-optimized.onnx",
...     optimization_config=optimization_config,
... )

And that’s it - the model is now optimized and ready for inference! As you can see, the process is similar in each case:

Define the optimization / quantization strategies via an OptimizationConfig / QuantizationConfig object
Instantiate a ORTQuantizer or ORTOptimizer class
Apply the export() method
Run inference

Check out the examples directory for more sophisticated usage.

Happy optimising 🤗!