Optimum documentation

Optimum Inference with OpenVINO

You are viewing v1.7.3 version. A newer version v1.18.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Optimum Inference with OpenVINO

Optimum Intel can be used to load optimized models from the Hugging Face Hub and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs.

Switching from Transformers to Optimum Inference

You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors (see the full list of supported devices). For that, just replace the AutoModelForXxx class with the corresponding OVModelForXxx class. To load a Transformers model and convert it to the OpenVINO format on-the-fly, you can set export=True when loading your model.

Here is an example on how to perform inference with OpenVINO Runtime for a text classification class:

- from transformers import AutoModelForSequenceClassification
+ from optimum.intel import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
cls_pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
outputs = cls_pipe("He's a dreadful magician.")

[{'label': 'NEGATIVE', 'score': 0.9919503927230835}]

To easily save the resulting model, you can use the save_pretrained() method, which will save both the BIN and XML files describing the graph.

# Save the exported model
save_directory = "a_local_path"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

By default, OVModelForXxx support dynamic shapes, enabling inputs of every shapes. To speed up inference, static shapes can be enabled by giving the desired inputs shapes.

# Fix the batch size to 1 and the sequence length to 9
model.reshape(1, 9)
# Compile the model before the first inference
model.compile()

Currently, OpenVINO only supports static shapes when running inference on Intel GPUs. FP16 precision can also be enabled in order to further decrease latency.

# Fix the batch size to 1 and the sequence length to 9
model.reshape(1, 9)
# Enable FP16 precision
model.half()
model.to("gpu")
# Compile the model before the first inference
model.compile()

When fixing the shapes with the reshape() method, inference cannot be performed with an input of a different shape. When instantiating your pipeline, you can specify the maximum total input sequence length after tokenization in order for shorter sequences to be padded and for longer sequences to be truncated.

from datasets import load_dataset
from transformers import AutoTokenizer, pipeline
from evaluate import evaluator
from optimum.intel import OVModelForQuestionAnswering

model_id = "distilbert-base-cased-distilled-squad"
model = OVModelForQuestionAnswering.from_pretrained(model_id, export=True)
model.reshape(1, 384)
tokenizer = AutoTokenizer.from_pretrained(model_id)
eval_dataset = load_dataset("squad", split="validation").select(range(50))
task_evaluator = evaluator("question-answering")
qa_pipe = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    max_seq_len=384,
    padding="max_length",
    truncation=True,
)
metric = task_evaluator.compute(model_or_pipeline=qa_pipe, data=eval_dataset, metric="squad")

By default the model will be compiled when instantiating our OVModel. In the case where the model is reshaped, placed to an other device or if FP16 precision is enabled, the model will need to be recompiled again, which will happen by default before the first inference (thus inflating the latency of the first inference). To avoid an unnecessary compilation, you can disable the first compilation by setting compile=False. The model should also be compiled before the first inference with model.compile().

from optimum.intel import OVModelForSequenceClassification

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
# Load the model and disable the model compilation
model = OVModelForSequenceClassification.from_pretrained(model_id, export=True, compile=False)
model.half()
# Compile the model before the first inference
model.compile()

Export and inference of sequence-to-sequence models

Sequence-to-sequence (Seq2Seq) models, that generate a new sequence from an input, can also be used when running inference with OpenVINO. When Seq2Seq models are exported to the OpenVINO IR, they are decomposed into two parts : the encoder and the “decoder” (which actually consists of the decoder with the language modeling head), that are later combined during inference. To leverage the pre-computed key/values hidden-states to speed up sequential decoding, simply pass use_cache=True to the from_pretrained() method. An additional model component will be exported: the “decoder” with pre-computed key/values as one of its inputs. This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding. Here is an example on how you can run inference for a translation task using an MarianMT model and then export it to the OpenVINO IR:

from transformers import AutoTokenizer, pipeline
from optimum.intel import OVModelForSeq2SeqLM

model_id = "t5-small"
model = OVModelForSeq2SeqLM.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
translation_pipe = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)
text = "He never went out without a book under his arm, and he often came back with two."
result = translation_pipe(text)

# Save the exported model
save_directory = "a_local_path"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

[{'translation_text': "Il n'est jamais sorti sans un livre sous son bras, et il est souvent revenu avec deux."}]

Export and inference of Stable Diffusion models

Stable Diffusion models can also be used when running inference with OpenVINO. When Stable Diffusion models are exported to the OpenVINO format, they are decomposed into three components that are later combined during inference:

  • The text encoder
  • The U-NET
  • The VAE encoder
  • The VAE decoder

Make sure you have 🤗 Diffusers installed.

To install diffusers:

pip install diffusers

Here is an example of how you can load an OpenVINO Stable Diffusion model and run inference using OpenVINO Runtime:

from optimum.intel import OVStableDiffusionPipeline

model_id = "echarlaix/stable-diffusion-v1-5-openvino"
stable_diffusion = OVStableDiffusionPipeline.from_pretrained(model_id)
prompt = "sailing ship in storm by Rembrandt"
images = stable_diffusion(prompt).images

To load your PyTorch model and convert it to OpenVINO on-the-fly, you can set export=True.

model_id = "runwayml/stable-diffusion-v1-5"
stable_diffusion = OVStableDiffusionPipeline.from_pretrained(model_id, export=True)
# Don't forget to save the exported model
stable_diffusion.save_pretrained("a_local_path")

To further speed up inference, the model can be statically reshaped :

# Define the shapes related to the inputs and desired outputs
batch_size = 1
num_images_per_prompt = 1
height = 512
width = 512

# Statically reshape the model
stable_diffusion.reshape(batch_size=batch_size, height=height, width=width, num_images_per_prompt=num_images_per_prompt)
# Compile the model before the first inference
stable_diffusion.compile()

# Run inference
images = stable_diffusion(prompt, height=height, width=width, num_images_per_prompt=num_images_per_prompt).images

In case you want to change any parameters such as the outputs height or width, you’ll need to statically reshape your model once again.

img

Supported tasks

As shown in the table below, each task is associated with a class enabling to automatically load your model.

Task Auto Class
sequence-classification OVModelForSequenceClassification
token-classification OVModelForTokenClassification
question-answering OVModelForQuestionAnswering
audio-classification OVModelForAudioClassification
image-classification OVModelForImageClassification
feature-extraction OVModelForFeatureExtraction
masked-lm OVModelForMaskedLM
causal-lm OVModelForCausalLM
seq2seq-lm OVModelForSeq2SeqLM
text-to-image OVStableDiffusionPipeline