Accelerate your models with ๐Ÿค— Optimum Intel and OpenVINO

Published November 2, 2022
Update on GitHub


Last July, we announced that Intel and Hugging Face would collaborate on building state-of-the-art yet simple hardware acceleration tools for Transformer models. โ€‹ Today, we are very happy to announce that we added Intel OpenVINO to Optimum Intel. You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors (see the full list of supported devices) using Transformers models which can be hosted either on the Hugging Face hub or locally. You can also quantize your model with the OpenVINO Neural Network Compression Framework (NNCF), and reduce its size and prediction latency in near minutes. โ€‹

This first release is based on OpenVINO 2022.2 and enables inference for a large quantity of PyTorch models using our OVModels. Post-training static quantization and quantization aware training can be applied on many encoder models (BERT, DistilBERT, etc.). More encoder models will be supported in the upcoming OpenVINO release. Currently the quantization of Encoder Decoder models is not enabled, however this restriction should be lifted with our integration of the next OpenVINO release.

โ€‹Let us show you how to get started in minutes!โ€‹

Quantizing a Vision Transformer with Optimum Intel and OpenVINO

โ€‹ In this example, we will run post-training static quantization on a Vision Transformer (ViT) model fine-tuned for image classification on the food101 dataset. โ€‹

Quantization is a process that lowers memory and compute requirements by reducing the bit width of model parameters. Reducing the number of bits means that the resulting model requires less memory at inference time, and that operations like matrix multiplication can be performed faster thanks to integer arithmetic.

First, let's create a virtual environment and install all dependencies.โ€‹

virtualenv openvino
source openvino/bin/activate
pip install pip --upgrade
pip install optimum[openvino,nncf] torchvision evaluate

Next, moving to a Python environment, we import the appropriate modules and download the original model as well as its processor. โ€‹

from transformers import AutoImageProcessor, AutoModelForImageClassification
model_id = "juliensimon/autotrain-food101-1471154050"
model = AutoModelForImageClassification.from_pretrained(model_id)
processor = AutoImageProcessor.from_pretrained(model_id)

โ€‹ Post-training static quantization requires a calibration step where data is fed through the network in order to compute the quantized activation parameters. Here, we take 300 samples from the original dataset to build the calibration dataset. โ€‹

from import OVQuantizer
quantizer = OVQuantizer.from_pretrained(model)
calibration_dataset = quantizer.get_calibration_dataset(

As usual with image datasets, we need to apply the same image transformations that were used at training time. We use the preprocessing defined in the processor. We also define a data collation function to feed the model batches of properly formatted tensors. โ€‹

import torch
from torchvision.transforms import (
normalize = Normalize(mean=processor.image_mean, std=processor.image_std)
size = processor.size["height"]
_val_transforms = Compose(
def val_transforms(example_batch):
    example_batch["pixel_values"] = [_val_transforms(pil_img.convert("RGB")) for pil_img in example_batch["image"]]
    return example_batch
def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

For our first attempt, we use the default configuration for quantization. You can also specify the number of samples to use during the calibration step, which is by default 300.

from import OVConfig
quantization_config = OVConfig()
quantization_config.compression["initializer"]["range"]["num_init_samples"] = 300

We're now ready to quantize the model. The OVQuantizer.quantize() method quantizes the model and exports it to the OpenVINO format. The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can run on any target Intelยฎ device.

save_dir = "quantized_model"

# Apply static quantization and export the resulting quantized model to OpenVINO IR format

A minute or two later, the model has been quantized. We can then easily load it with our OVModelForXxx classes, the equivalent of the Transformers AutoModelForXxx classes found in the transformers library. Likewise, we can create pipelines and run inference with OpenVINO Runtime. โ€‹

from transformers import pipeline
from import OVModelForImageClassification
ov_model = OVModelForImageClassification.from_pretrained(save_dir)
ov_pipe = pipeline("image-classification", model=ov_model, image_processor=processor)
outputs = ov_pipe("")

โ€‹To verify that quantization did not have a negative impact on accuracy, we applied an evaluation step to compare the accuracy of the original model with its quantized counterpart. We evaluate both models on a subset of the dataset (taking only 20% of the evaluation dataset). We observed little to no loss in accuracy with both models having an accuracy of 87.6.

from datasets import load_dataset
from evaluate import evaluator

# We run the evaluation step on 20% of the evaluation dataset
eval_dataset = load_dataset("food101", split="validation").select(range(5050))
task_evaluator = evaluator("image-classification")

ov_eval_results = task_evaluator.compute(

trfs_pipe = pipeline("image-classification", model=model, image_processor=processor)
trfs_eval_results = task_evaluator.compute(
print(trfs_eval_results, ov_eval_results)

Looking at the quantized model, we see that its memory size decreased by 3.8x from 344MB to 90MB. Running a quick benchmark on 5050 image predictions, we also notice a speedup in latency of 2.4x, from 98ms to 41ms per sample. That's not bad for a few lines of code!

โš ๏ธ An important thing to mention is that the model is compiled just before the first inference, which will inflate the latency of the first inference. So before doing your own benchmark, make sure to first warmup your model by doing at least one prediction.

You can find the resulting model hosted on the Hugging Face hub. To load it, you can easily do as follows:

from import OVModelForImageClassification
ov_model = OVModelForImageClassification.from_pretrained("echarlaix/vit-food101-int8")

Now it's your turn

โ€‹ As you can see, it's pretty easy to accelerate your models with ๐Ÿค— Optimum Intel and OpenVINO. If you'd like to get started, please visit the Optimum Intel repository, and don't forget to give it a star โญ. You'll also find additional examples there. If you'd like to dive deeper into OpenVINO, the Intel documentation has you covered.

โ€‹ Give it a try and let us know what you think. We'd love to hear your feedback on the Hugging Face forum, and please feel free to request features or file issues on Github. โ€‹

Have fun with ๐Ÿค— Optimum Intel, and thank you for reading.