Optimum Inference with ONNX Runtime

Optimum is a utility package for building and running inference with accelerated runtime like ONNX Runtime. Optimum can be used to load optimized models from the Hugging Face Hub and create pipelines to run accelerated inference without rewriting your APIs.

Switching from Transformers to Optimum Inference

The Optimum Inference models are API compatible with Hugging Face Transformers models. This means you can just replace your AutoModelForXxx class with the corresponding ORTModelForXxx class in optimum. For example, this is how you can use a question answering model in optimum:

from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import ORTModelForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # pytorch checkpoint
+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # onnx checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)

question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = onnx_qa(question, context)

Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones. Simply pass from_transformers=True to the from_pretrained() method, and your model will be loaded and converted to ONNX on-the-fly:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification

# load model from hub and convert
>>> model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# create pipeline
>>> onnx_classifier = pipeline("text-classification",model=model,tokenizer=tokenizer)

>>> result = onnx_classifier(text="This is a great model")
[{'label': 'POSITIVE', 'score': 0.9998838901519775}]

You can find a complete walkhrough Optimum Inference for ONNX Runtime in this notebook.

Working with the Hugging Face Model Hub

The Optimum model classes like ORTModelForSequenceClassification are integrated with the Hugging Face Model Hub, which means you can not only load model from the Hub, but also push your models to the Hub with push_to_hub() method. Below is an example which downloads a vanilla Transformers model from the Hub and converts it to an optimum onnxruntime model and pushes it back into a new repository.

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification

# load model from hub and convert
>>> model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# save converted model
>>> model.save_pretrained("a_local_path_for_convert_onnx_model")
>>> tokenizer.save_pretrained("a_local_path_for_convert_onnx_model")

# push model onnx model to HF Hub
>>> model.push_to_hub("a_local_path_for_convert_onnx_model",
                  repository_id="my-onnx-repo",
                  use_auth_token=True
                  )

ORTModel

class optimum.onnxruntime.ORTModel

< source >

( model = None config = None **kwargs )

Base ORTModel class for implementing models using ONNX Runtime. The ORTModel implements generic methods for interacting with the Hugging Face Hub as well as exporting vanilla transformers models to ONNX using transformers.onnx toolchain. The ORTModel implements additionally generic methods for optimizing and quantizing Onnx models.

load_model

< source >

( path: typing.Union[str, pathlib.Path] provider = None )

Parameters

path (str or Path) — Directory from which to load
provider(str, optional) — Onnxruntime provider to use for loading the model, defaults to CUDAExecutionProvider if GPU is available else CPUExecutionProvider

loads ONNX Inference session with Provider. Default Provider is if CUDAExecutionProvider GPU available else CPUExecutionProvider

ORTModelForFeatureExtraction

class optimum.onnxruntime.ORTModelForFeatureExtraction

< source >

( *args **kwargs )

Parameters

config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a MaskedLMOutput for feature-extraction tasks.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Feature Extraction model for ONNX.

forward

< source >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:
- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?

The ORTModelForFeatureExtraction forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of feature extraction:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")

>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)

Example using transformers.pipeline:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> onnx_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer)

>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_extractor(text)

ORTModelForQuestionAnswering

class optimum.onnxruntime.ORTModelForQuestionAnswering

< source >

( *args **kwargs )

Parameters

config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a QuestionAnsweringModelOutput for extractive question-answering tasks like SQuAD.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Question Answering model for ONNX.

forward

< source >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:
- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?

The ORTModelForQuestionAnswering forward method, overrides the __call__ special method.

Example of question answering:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])

>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits

Example using transformers.pipeline:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)

>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> pred = onnx_qa(question, text)

ORTModelForSequenceClassification

class optimum.onnxruntime.ORTModelForSequenceClassification

< source >

( *args **kwargs )

Parameters

config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Sequence Classification model for ONNX.

forward

< source >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:
- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?

The ORTModelForSequenceClassification forward method, overrides the __call__ special method.

Example of single-label classification:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)

Example using transformers.pipelines:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

>>> text = "Hello, my dog is cute"
>>> pred = onnx_classifier(text)

Example using zero-shot-classification transformers.pipelines:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> onnx_z0 = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

>>> sequence_to_classify = "Who are you voting for in 2020?"
>>> candidate_labels = ["Europe", "public health", "politics", "elections"]
>>> pred = onnx_z0(sequence_to_classify, candidate_labels, multi_class=True)

ORTModelForTokenClassification

class optimum.onnxruntime.ORTModelForTokenClassification

< source >

( *args **kwargs )

Parameters

config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Token Classification model for ONNX.

forward

< source >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:
- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?

The ORTModelForTokenClassification forward method, overrides the __call__ special method.

Example of token classification:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForTokenClassification
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")

>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)

Example using transformers.pipelines:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForTokenClassification

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")
>>> onnx_ner = pipeline("token-classification", model=model, tokenizer=tokenizer)

>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_ner(text)

ORTModelForCausalLM

class optimum.onnxruntime.ORTModelForCausalLM

< source >

( *args **kwargs )

Parameters

config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a causal language modeling head on top (linear layer with weights tied to the input embeddings).

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Causal LM model for ONNX.

forward

< source >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:
- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?

The ORTModelForCausalLM forward method, overrides the __call__ special method.

Example of text generation:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForCausalLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")

>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")

>>> gen_tokens = model.generate(**inputs,do_sample=True,temperature=0.9, min_length=20,max_length=20)
>>> tokenizer.batch_decode(gen_tokens)

Example using transformers.pipelines:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")
>>> onnx_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)

>>> text = "My name is Philipp and I live in Germany."
>>> gen = onnx_gen(text)