Optimization

🤗 Optimum provides an optimum.onnxruntime package that enables you to apply graph optimization on many model hosted on the 🤗 hub using the ONNX Runtime model optimization tool.

Creating an `ORTOptimizer`

The ORTOptimizer class is used to optimize your ONNX model. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.

Using an already initialized ORTModelForXXX class.

>>> from optimum.onnxruntime import ORTOptimizer, ORTModelForSequenceClassification

# Loading ONNX Model from the Hub
>>> model = ORTModelForSequenceClassification.from_pretrained(
...     "optimum/distilbert-base-uncased-finetuned-sst-2-english"
... )

# Create an optimizer from an ORTModelForXXX
>>> optimizer = ORTOptimizer.from_pretrained(model)

Using a local ONNX model from a directory.

>>> from optimum.onnxruntime import ORTOptimizer

# This assumes a model.onnx exists in path/to/model
>>> optimizer = ORTOptimizer.from_pretrained("path/to/model")

Optimization Configuration

The OptimizationConfig class allows to specify how the optimization should be performed by the ORTOptimizer.

In the optimization configuration, there are 4 possible optimization levels:

optimization_level=0: to disable all optimizations
optimization_level=1: to enable basic optimizations such as constant folding or redundant node eliminations
optimization_level=2: to enable extended graph optimizations such as node fusions
optimization_level=99: to enable data layout optimizations

Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. More information here.

enable_transformers_specific_optimizations=True means that transformers-specific graph fusion and approximation are performed in addition to the ONNX Runtime optimizations described above. Here is a list of the possible optimizations you can enable:

Gelu fusion with disable_gelu_fusion=False,
Layer Normalization fusion with disable_layer_norm_fusion=False,
Attention fusion with disable_attention_fusion=False,
SkipLayerNormalization fusion with disable_skip_layer_norm_fusion=False,
Add Bias and SkipLayerNormalization fusion with disable_bias_skip_layer_norm_fusion=False,
Add Bias and Gelu / FastGelu fusion with disable_bias_gelu_fusion=False,
Gelu approximation with enable_gelu_approximation=True.

While OptimizationConfig gives you full control on how to do optimization, it can be hard to know what to enable / disable. Instead, you can use AutoOptimizationConfig which provides 3 common optimizations levels:

O1: basic general optimizations.
O2: basic and extended general optimizations, transformers-specific fusions.
O3: same as O2 with Gelu approximation.
O4: same as O3 with mixed precision.

Example: Loading a O2 OptimizationConfig

>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2()

You can also specify custom argument that were not defined in the O2 configuration, for instance:

>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2(disable_embed_layer_norm_fusion=False)

Optimization examples

Below you will find an easy end-to-end example on how to optimize distilbert-base-uncased-finetuned-sst-2-english.

>>> from optimum.onnxruntime import (
...     AutoOptimizationConfig, ORTOptimizer, ORTModelForSequenceClassification
... )

>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_optimized"

>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)

>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = AutoOptimizationConfig.O2()

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)

Below you will find an easy end-to-end example on how to optimize a Seq2Seq model sshleifer/distilbart-cnn-12-6”.

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import  OptimizationConfig, ORTOptimizer, ORTModelForSeq2SeqLM

>>> model_id = "sshleifer/distilbart-cnn-12-6"
>>> save_dir = "distilbart_optimized"

>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)

>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = OptimizationConfig(
...     optimization_level=2,
...     enable_transformers_specific_optimizations=True,
...     optimize_for_gpu=False,
... )

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> optimized_model = ORTModelForSeq2SeqLM.from_pretrained(save_dir)
>>> tokens = tokenizer("This is a sample input", return_tensors="pt")
>>> outputs = optimized_model.generate(**tokens)

Optimum

Optimization

Creating an ORTOptimizer

Optimization Configuration

Optimization examples

Creating an `ORTOptimizer`