Optimum Inference with ONNX Runtime
Optimum is a utility package for building and running inference with accelerated runtime like ONNX Runtime. Optimum can be used to load optimized models from the Hugging Face Hub and create pipelines to run accelerated inference without rewriting your APIs.
Switching from Transformers to Optimum Inference
The Optimum Inference models are API compatible with Hugging Face Transformers models. This means you can just replace your AutoModelForXxx
class with the corresponding ORTModelForXxx
class in optimum
. For example, this is how you can use a question answering model in optimum
:
from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import ORTModelForQuestionAnswering
-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # pytorch checkpoint
+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # onnx checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = onnx_qa(question, context)
Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones. Simply pass from_transformers=True
to the from_pretrained()
method, and your model will be loaded and converted to ONNX on-the-fly:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
# load model from hub and convert
>>> model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# create pipeline
>>> onnx_classifier = pipeline("text-classification",model=model,tokenizer=tokenizer)
>>> result = onnx_classifier(text="This is a great model")
[{'label': 'POSITIVE', 'score': 0.9998838901519775}]
You can find a complete walkhrough Optimum Inference for ONNX Runtime in this notebook.
Working with the Hugging Face Model Hub
The Optimum model classes like ORTModelForSequenceClassification are integrated with the Hugging Face Model Hub, which means you can not only
load model from the Hub, but also push your models to the Hub with push_to_hub()
method. Below is an example which downloads a vanilla Transformers model
from the Hub and converts it to an optimum onnxruntime model and pushes it back into a new repository.
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
# load model from hub and convert
>>> model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# save converted model
>>> model.save_pretrained("a_local_path_for_convert_onnx_model")
>>> tokenizer.save_pretrained("a_local_path_for_convert_onnx_model")
# push model onnx model to HF Hub
>>> model.push_to_hub("a_local_path_for_convert_onnx_model",
repository_id="my-onnx-repo",
use_auth_token=True
)
ORTModel
Base ORTModel class for implementing models using ONNX Runtime. The ORTModel implements generic methods for interacting
with the Hugging Face Hub as well as exporting vanilla transformers models to ONNX using transformers.onnx
toolchain.
The ORTModel implements additionally generic methods for optimizing and quantizing Onnx models.
load_model
< source >( path: typing.Union[str, pathlib.Path] provider = None )
loads ONNX Inference session with Provider. Default Provider is if CUDAExecutionProvider GPU available else CPUExecutionProvider
ORTModelForFeatureExtraction
class optimum.onnxruntime.ORTModelForFeatureExtraction
< source >( *args **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a MaskedLMOutput for feature-extraction tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Feature Extraction model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForFeatureExtraction
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of feature extraction:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipeline
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> onnx_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer)
>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_extractor(text)
ORTModelForQuestionAnswering
class optimum.onnxruntime.ORTModelForQuestionAnswering
< source >( *args **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a QuestionAnsweringModelOutput for extractive question-answering tasks like SQuAD.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Question Answering model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForQuestionAnswering
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of question answering:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])
>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits
Example using transformers.pipeline
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> pred = onnx_qa(question, text)
ORTModelForSequenceClassification
class optimum.onnxruntime.ORTModelForSequenceClassification
< source >( *args **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Sequence Classification model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForSequenceClassification
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of single-label classification:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
>>> text = "Hello, my dog is cute"
>>> pred = onnx_classifier(text)
Example using zero-shot-classification transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> onnx_z0 = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)
>>> sequence_to_classify = "Who are you voting for in 2020?"
>>> candidate_labels = ["Europe", "public health", "politics", "elections"]
>>> pred = onnx_z0(sequence_to_classify, candidate_labels, multi_class=True)
ORTModelForTokenClassification
class optimum.onnxruntime.ORTModelForTokenClassification
< source >( *args **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Token Classification model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForTokenClassification
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of token classification:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForTokenClassification
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")
>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")
>>> onnx_ner = pipeline("token-classification", model=model, tokenizer=tokenizer)
>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_ner(text)
ORTModelForCausalLM
class optimum.onnxruntime.ORTModelForCausalLM
< source >( *args **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a causal language modeling head on top (linear layer with weights tied to the input embeddings).
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Causal LM model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForCausalLM
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of text generation:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForCausalLM
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")
>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
>>> gen_tokens = model.generate(**inputs,do_sample=True,temperature=0.9, min_length=20,max_length=20)
>>> tokenizer.batch_decode(gen_tokens)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")
>>> onnx_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
>>> text = "My name is Philipp and I live in Germany."
>>> gen = onnx_gen(text)