This quick tour is intended for developers who are ready to dive into the code and see examples of how to integrate 🤗 Optimum into their model training and inference workflows.
To accelerate inference with ONNX Runtime, 🤗 Optimum uses configuration objects to define parameters for graph optimization and quantization. These objects are then used to instantiate dedicated optimizers and quantizers.
Before applying quantization or optimization, first we need to load our model. To load a model and run inference with ONNX Runtime, you can just replace the canonical Transformers
AutoModelForXxx class with the corresponding
ORTModelForXxx class. If you want to load from a PyTorch checkpoint, set
from_transformers=True to export your model to the ONNX format.
from optimum.onnxruntime import ORTModelForSequenceClassification from transformers import AutoTokenizer model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" save_directory = "tmp/onnx/" # Load a model from transformers and export it to ONNX tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, from_transformers=True) # Save the ONNX model and tokenizer ort_model.save_pretrained(save_directory) tokenizer.save_pretrained(save_directory)
Let’s see now how we can apply dynamic quantization with ONNX Runtime:
from optimum.onnxruntime.configuration import AutoQuantizationConfig from optimum.onnxruntime import ORTQuantizer # Define the quantization methodology qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False) quantizer = ORTQuantizer.from_pretrained(ort_model) # Apply dynamic quantization on the model quantizer.quantize(save_dir=save_directory, quantization_config=qconfig)
In this example, we’ve quantized a model from the Hugging Face Hub, in the same manner we can quantize a model hosted locally by providing the path to the directory containing the model weights. The result from applying the
quantize() method is a
model_quantized.onnx file that can be used to run inference. Here’s an example of how to load an ONNX Runtime model and generate predictions with it:
from optimum.onnxruntime import ORTModelForSequenceClassification from transformers import pipeline, AutoTokenizer model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name="model_quantized.onnx") tokenizer = AutoTokenizer.from_pretrained(save_directory) classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) results = classifier("I love burritos!")
To load a model and run inference with OpenVINO Runtime, you can just replace your
AutoModelForXxx class with the corresponding
If you want to load a PyTorch checkpoint, set
from_transformers=True to convert your model to the OpenVINO IR (Intermediate Representation).
- from transformers import AutoModelForSequenceClassification + from optimum.intel.openvino import OVModelForSequenceClassification from transformers import AutoTokenizer, pipeline # Download a tokenizer and model from the Hub and convert to OpenVINO format tokenizer = AutoTokenizer.from_pretrained(model_id) model_id = "distilbert-base-uncased-finetuned-sst-2-english" - model = AutoModelForSequenceClassification.from_pretrained(model_id) + model = OVModelForSequenceClassification.from_pretrained(model_id, from_transformers=True) # Run inference! classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) results = classifier("He's a dreadful magician.")
To train transformers on Habana’s Gaudi processors, 🤗 Optimum provides a
GaudiTrainer that is very similar to the 🤗 Transformers Trainer. Here is a simple example:
- from transformers import Trainer, TrainingArguments + from optimum.habana import GaudiTrainer, GaudiTrainingArguments # Download a pretrained model from the Hub model = AutoModelForXxx.from_pretrained("bert-base-uncased") # Define the training arguments - training_args = TrainingArguments( + training_args = GaudiTrainingArguments( output_dir="path/to/save/folder/", + use_habana=True, + use_lazy_mode=True, + gaudi_config_name="Habana/bert-base-uncased", ... ) # Initialize the trainer - trainer = Trainer( + trainer = GaudiTrainer( model=model, args=training_args, train_dataset=train_dataset, ... ) # Use Habana Gaudi processor for training! trainer.train()
To train transformers on Graphcore’s IPUs, 🤗 Optimum provides a
IPUTrainer that is very similar to the 🤗 Transformers Trainer. Here is a simple example:
- from transformers import Trainer, TrainingArguments + from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments # Download a pretrained model from the Hub model = AutoModelForXxx.from_pretrained("bert-base-uncased") # Define the training arguments - training_args = TrainingArguments( + training_args = IPUTrainingArguments( output_dir="path/to/save/folder/", + ipu_config_name="Graphcore/bert-base-ipu", # Any IPUConfig on the Hub or stored locally ... ) # Define the configuration to compile and put the model on the IPU + ipu_config = IPUConfig.from_pretrained(training_args.ipu_config_name) # Initialize the trainer - trainer = Trainer( + trainer = IPUTrainer( model=model, + ipu_config=ipu_config args=training_args, train_dataset=train_dataset ... ) # Use Graphcore IPU for training! trainer.train()
To train transformers with ONNX Runtime’s acceleration features, 🤗 Optimum provides a
ORTTrainer that is very similar to the 🤗 Transformers Trainer. Here is a simple example:
- from transformers import Trainer, TrainingArguments + from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments # Download a pretrained model from the Hub model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") # Define the training arguments - training_args = TrainingArguments( + training_args = ORTTrainingArguments( output_dir="path/to/save/folder/", optim="adamw_ort_fused", ... ) # Create a ONNX Runtime Trainer - trainer = Trainer( + trainer = ORTTrainer( model=model, args=training_args, train_dataset=train_dataset, + feature="sequence-classification", # The model type to export to ONNX ... ) # Use ONNX Runtime for training! trainer.train()
The Optimum library handles out of the box the ONNX export of Transformers and Diffusers models!
Exporting a model to ONNX is as simple as
optimum-cli export onnx --model gpt2 gpt2_onnx/
Check out the help for more options:
optimum-cli export onnx --help
Check out the documentation for more.
BetterTransformer is a free-lunch PyTorch-native optimization to gain x1.25 - x4 speedup on the inference of Transformer-based models. It has been marked as stable in PyTorch 1.13. We integrated BetterTransformer with the most-used models from the 🤗 Transformers libary, and using the integration is as simple as:
from optimum.bettertransformer import BetterTransformer from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") model = BetterTransformer.transform(model)
Optimum integrates with
torch.fx, providing as a one-liner several graph transformations. We aim at supporting a better management of quantization through
torch.fx, both for quantization-aware training (QAT) and post-training quantization (PTQ).