Optimum documentation
Models
Models
ORTModel
class optimum.onnxruntime.ORTModel
< source >( model: InferenceSession config: PretrainedConfig use_io_binding: typing.Optional[bool] = None model_save_dir: typing.Union[str, pathlib.Path, tempfile.TemporaryDirectory, NoneType] = None preprocessors: typing.Optional[typing.List] = None **kwargs )
Base class for implementing models using ONNX Runtime.
The ORTModel implements generic methods for interacting with the Hugging Face Hub as well as exporting vanilla
transformers models to ONNX using optimum.exporters.onnx
toolchain.
Class attributes:
- model_type (
str
, optional, defaults to"onnx_model"
) — The name of the model type to use when registering the ORTModel classes. - auto_model_class (
Type
, optional, defaults toAutoModel
) — The “AutoModel” class to represented by the current ORTModel class.
Common attributes:
- model (
ort.InferenceSession
) — The ONNX Runtime InferenceSession that is running the model. - config (PretrainedConfig — The configuration of the model.
- use_io_binding (
bool
, optional, defaults toTrue
) — Whether to use I/O bindings with ONNX Runtime with the CUDAExecutionProvider, this can significantly speedup inference depending on the task. - model_save_dir (
Path
) — The directory where the model exported to ONNX is saved. By defaults, if the loaded model is local, the directory where the original model will be used. Otherwise, the cache directory is used. - providers (`List[str]) — The list of execution providers available to ONNX Runtime.
from_pretrained
< source >(
model_id: typing.Union[str, pathlib.Path]
from_transformers: bool = False
force_download: bool = False
use_auth_token: typing.Optional[str] = None
cache_dir: typing.Optional[str] = None
subfolder: str = ''
config: typing.Optional[ForwardRef('PretrainedConfig')] = None
local_files_only: bool = False
provider: str = 'CPUExecutionProvider'
session_options: typing.Optional[onnxruntime.capi.onnxruntime_pybind11_state.SessionOptions] = None
provider_options: typing.Union[typing.Dict[str, typing.Any], NoneType] = None
**kwargs
)
→
ORTModel
Parameters
-
model_id (
Union[str, Path]
) — Can be either:- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like
bert-base-uncased
, or namespaced under a user or organization name, likedbmdz/bert-base-german-cased
. - A path to a directory containing a model saved using
~OptimizedModel.save_pretrained
, e.g.,./my_model_directory/
.
- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like
-
from_transformers (
bool
, optional, defaults toFalse
) — Defines whether the providedmodel_id
contains a vanilla Transformers checkpoint. -
force_download (
bool
, optional, defaults toTrue
) — Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist. -
use_auth_token (
Optional[str]
, optional) — The token to use as HTTP bearer authorization for remote files. IfTrue
, will use the token generated when runningtransformers-cli login
(stored in~/.huggingface
). -
cache_dir (
Optional[str]
, optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used. -
subfolder (
str
, optional, defaults to""
) — In case the relevant files are located inside a subfolder of the model repo either locally or on huggingface.co, you can specify the folder name here. -
config (
Optional[transformers.PretrainedConfig]
, optional) — The model configuration. -
local_files_only(
bool
, optional, defaults toFalse
) — Whether or not to only look at local files (i.e., do not try to download the model). -
provider (
str
, optional, defaults to"CPUExecutionProvider"
) — ONNX Runtime provider to use for loading the model. See https://onnxruntime.ai/docs/execution-providers/ for possible providers. -
session_options (
Optional[onnxruntime.SessionOptions]
, optional), — ONNX Runtime session options to use for loading the model. -
provider_options (
Optional[Dict[str, Any]]
, optional) — Provider option dictionaries corresponding to the provider used. See available options for each provider: https://onnxruntime.ai/docs/api/c/group___global.html . -
kwargs (
Dict[str, Any]
) — Will be passed to the underlying model loading methods.
Returns
ORTModel
The loaded ORTModel model.
Instantiate a pretrained model from a pre-trained model configuration.
load_model
< source >( path: typing.Union[str, pathlib.Path] provider: str = 'CPUExecutionProvider' session_options: typing.Optional[onnxruntime.capi.onnxruntime_pybind11_state.SessionOptions] = None provider_options: typing.Union[typing.Dict[str, typing.Any], NoneType] = None )
Parameters
-
path (
Union[str, Path]
) — Path of the ONNX model. -
provider (
str
, optional, defaults to"CPUExecutionProvider"
) — ONNX Runtime provider to use for loading the model. See https://onnxruntime.ai/docs/execution-providers/ for possible providers. -
session_options (
Optional[onnxruntime.SessionOptions]
, optional) — ONNX Runtime session options to use for loading the model. -
provider_options (
Optional[Dict[str, Any]]
, optional) — Provider option dictionary corresponding to the provider used. See available options for each provider: https://onnxruntime.ai/docs/api/c/group___global.html .
Loads an ONNX Inference session with a given provider. Default provider is CPUExecutionProvider
to match the
default behaviour in PyTorch/TensorFlow/JAX.
to
< source >(
device: typing.Union[torch.device, str, int]
)
→
ORTModel
Changes the ONNX Runtime provider according to the device.
ORTModelForCausalLM
class optimum.onnxruntime.ORTModelForCausalLM
< source >( decoder_session: InferenceSession config: PretrainedConfig decoder_with_past_session: typing.Optional[onnxruntime.capi.onnxruntime_inference_collection.InferenceSession] = None use_io_binding: typing.Optional[bool] = None model_save_dir: typing.Union[str, pathlib.Path, tempfile.TemporaryDirectory, NoneType] = None preprocessors: typing.Optional[typing.List] = None **kwargs )
ONNX model with a causal language modeling head for ONNX Runtime inference.
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.FloatTensor] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None **kwargs )
Parameters
-
input_ids (
torch.LongTensor
) — Indices of decoder input sequence tokens in the vocabulary of shape(batch_size, sequence_length)
. -
attention_mask (
torch.LongTensor
) — Mask to avoid performing attention on padding token indices, of shape(batch_size, sequence_length)
. Mask values selected in[0, 1]
. -
past_key_values (
tuple(tuple(torch.FloatTensor), *optional*)
— Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. The tuple is of lengthconfig.n_layers
with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
.
The ORTModelForCausalLM
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of text generation:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForCausalLM
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")
>>> inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")
>>> gen_tokens = model.generate(**inputs,do_sample=True,temperature=0.9, min_length=20,max_length=20)
>>> tokenizer.batch_decode(gen_tokens)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")
>>> onnx_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
>>> text = "My name is Arthur and I live in"
>>> gen = onnx_gen(text)
ORTModelForCustomTasks
class optimum.onnxruntime.ORTModelForCustomTasks
< source >( model = None config = None use_io_binding = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information. -
use_io_binding (
bool
, optional) — Whether to use IOBinding during inference to avoid memory copy between the host and devices. Defaults toTrue
if the device is CUDA, otherwise defaults toFalse
.
ONNX Model for any custom tasks. It can be used to leverage the inference acceleration for any single-file ONNX model.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Model for any custom tasks if the ONNX model is stored in a single file.
The ORTModelForCustomTasks
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of custom tasks(e.g. a sentence transformers taking pooler_output
as output):
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForCustomTasks
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/sbert-all-MiniLM-L6-with-pooler")
>>> model = ORTModelForCustomTasks.from_pretrained("optimum/sbert-all-MiniLM-L6-with-pooler")
>>> inputs = tokenizer("I love burritos!", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooler_output = outputs.pooler_output
Example using transformers.pipelines
(only if the task is supported):
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForCustomTasks
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/sbert-all-MiniLM-L6-with-pooler")
>>> model = ORTModelForCustomTasks.from_pretrained("optimum/sbert-all-MiniLM-L6-with-pooler")
>>> onnx_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer)
>>> text = "I love burritos!"
>>> pred = onnx_extractor(text)
ORTModelForFeatureExtraction
class optimum.onnxruntime.ORTModelForFeatureExtraction
< source >( model = None config = None use_io_binding = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information. -
use_io_binding (
bool
, optional) — Whether to use IOBinding during inference to avoid memory copy between the host and devices. Defaults toTrue
if the device is CUDA, otherwise defaults toFalse
.
Onnx Model with a MaskedLMOutput for feature-extraction tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Feature Extraction model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForFeatureExtraction
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of feature extraction:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipeline
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> onnx_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer)
>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_extractor(text)
Prepares the buffer of output_name with a 1D tensor on shape: (batch_size, sequence_length, hidden_size).
ORTModelForImageClassification
class optimum.onnxruntime.ORTModelForImageClassification
< source >( model = None config = None use_io_binding = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information. -
use_io_binding (
bool
, optional) — Whether to use IOBinding during inference to avoid memory copy between the host and devices. Defaults toTrue
if the device is CUDA, otherwise defaults toFalse
.
Onnx Model for image-classification tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Image Classification model for ONNX.
forward
< source >( pixel_values: Tensor **kwargs )
Parameters
-
pixel_values (
torch.Tensor
of shape(batch_size, num_channels, height, width)
) — Pixel values corresponding to the images in the current batch. Pixel values can be obtained from encoded images usingAutoFeatureExtractor
.
The ORTModelForImageClassification
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of image classification:
>>> import requests
>>> from PIL import Image
>>> from optimum.onnxruntime import ORTModelForImageClassification
>>> from transformers import AutoFeatureExtractor
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> preprocessor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
>>> model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")
>>> inputs = preprocessor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
Example using transformers.pipeline
:
>>> import requests
>>> from PIL import Image
>>> from transformers import AutoFeatureExtractor, pipeline
>>> from optimum.onnxruntime import ORTModelForImageClassification
>>> preprocessor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
>>> model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")
>>> onnx_image_classifier = pipeline("image-classification", model=model, feature_extractor=preprocessor)
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> pred = onnx_image_classifier(url)
Prepares the buffer of logits with a 1D tensor on shape: (batch_size, config.num_labels).
ORTModelForQuestionAnswering
class optimum.onnxruntime.ORTModelForQuestionAnswering
< source >( model = None config = None use_io_binding = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information. -
use_io_binding (
bool
, optional) — Whether to use IOBinding during inference to avoid memory copy between the host and devices. Defaults toTrue
if the device is CUDA, otherwise defaults toFalse
.
Onnx Model with a QuestionAnsweringModelOutput for extractive question-answering tasks like SQuAD.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Question Answering model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForQuestionAnswering
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of question answering:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])
>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits
Example using transformers.pipeline
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> pred = onnx_qa(question, text)
Prepares the buffer of logits with a 1D tensor on shape: (batch_size, sequence_length).
ORTModelForSemanticSegmentation
class optimum.onnxruntime.ORTModelForSemanticSegmentation
< source >( model = None config = None use_io_binding = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information. -
use_io_binding (
bool
, optional) — Whether to use IOBinding during inference to avoid memory copy between the host and devices. Defaults toTrue
if the device is CUDA, otherwise defaults toFalse
.
Onnx Model with an all-MLP decode head on top e.g. for ADE20k, CityScapes.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Semantic Segmentation model for ONNX.
forward
< source >( **kwargs )
Parameters
-
pixel_values (
torch.Tensor
of shape(batch_size, num_channels, height, width)
) — Pixel values corresponding to the images in the current batch. Pixel values can be obtained from encoded images usingAutoFeatureExtractor
.
The ORTModelForSemanticSegmentation
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of semantic segmentation:
>>> import requests
>>> from PIL import Image
>>> from optimum.onnxruntime import ORTModelForSemanticSegmentation
>>> from transformers import AutoFeatureExtractor
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> preprocessor = AutoFeatureExtractor.from_pretrained("optimum/segformer-b0-finetuned-ade-512-512")
>>> model = ORTModelForSemanticSegmentation.from_pretrained("optimum/segformer-b0-finetuned-ade-512-512")
>>> inputs = preprocessor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
Example using transformers.pipeline
:
>>> import requests
>>> from PIL import Image
>>> from transformers import AutoFeatureExtractor, pipeline
>>> from optimum.onnxruntime import ORTModelForSemanticSegmentation
>>> preprocessor = AutoFeatureExtractor.from_pretrained("optimum/segformer-b0-finetuned-ade-512-512")
>>> model = ORTModelForSemanticSegmentation.from_pretrained("optimum/segformer-b0-finetuned-ade-512-512")
>>> onnx_image_segmenter = pipeline("image-segmentation", model=model, feature_extractor=preprocessor)
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> pred = onnx_image_segmenter(url)
ORTModelForSeq2SeqLM
class optimum.onnxruntime.ORTModelForSeq2SeqLM
< source >( encoder_session: InferenceSession decoder_session: InferenceSession config: PretrainedConfig decoder_with_past_session: typing.Optional[onnxruntime.capi.onnxruntime_inference_collection.InferenceSession] = None use_io_binding: typing.Optional[bool] = None model_save_dir: typing.Union[str, pathlib.Path, tempfile.TemporaryDirectory, NoneType] = None preprocessors: typing.Optional[typing.List] = None **kwargs )
Sequence-to-sequence model with a language modeling head for ONNX Runtime inference.
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.FloatTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None labels: typing.Optional[torch.LongTensor] = None **kwargs )
Parameters
-
input_ids (
torch.LongTensor
) — Indices of input sequence tokens in the vocabulary of shape(batch_size, encoder_sequence_length)
. -
attention_mask (
torch.LongTensor
) — Mask to avoid performing attention on padding token indices, of shape(batch_size, encoder_sequence_length)
. Mask values selected in[0, 1]
. -
decoder_input_ids (
torch.LongTensor
) — Indices of decoder input sequence tokens in the vocabulary of shape(batch_size, decoder_sequence_length)
. -
encoder_outputs (
torch.FloatTensor
) — The encoderlast_hidden_state
of shape(batch_size, encoder_sequence_length, hidden_size)
. -
past_key_values (
tuple(tuple(torch.FloatTensor), *optional*)
— Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. The tuple is of lengthconfig.n_layers
with each tuple having 2 tensors of shape(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)
and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
The ORTModelForSeq2SeqLM
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of text generation:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
>>> model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")
>>> inputs = tokenizer("My name is Eustache and I like to", return_tensors="pt")
>>> gen_tokens = model.generate(**inputs)
>>> outputs = tokenizer.batch_decode(gen_tokens)
Example using transformers.pipeline
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
>>> model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")
>>> onnx_translation = pipeline("translation_en_to_de", model=model, tokenizer=tokenizer)
>>> text = "My name is Eustache."
>>> pred = onnx_translation(text)
ORTModelForSequenceClassification
class optimum.onnxruntime.ORTModelForSequenceClassification
< source >( model = None config = None use_io_binding = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information. -
use_io_binding (
bool
, optional) — Whether to use IOBinding during inference to avoid memory copy between the host and devices. Defaults toTrue
if the device is CUDA, otherwise defaults toFalse
.
Onnx Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Sequence Classification model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForSequenceClassification
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of single-label classification:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
>>> text = "Hello, my dog is cute"
>>> pred = onnx_classifier(text)
Example using zero-shot-classification transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> onnx_z0 = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)
>>> sequence_to_classify = "Who are you voting for in 2020?"
>>> candidate_labels = ["Europe", "public health", "politics", "elections"]
>>> pred = onnx_z0(sequence_to_classify, candidate_labels, multi_class=True)
Prepares the buffer of logits with a 1D tensor on shape: (batch_size, config.num_labels).
ORTModelForSpeechSeq2Seq
class optimum.onnxruntime.ORTModelForSpeechSeq2Seq
< source >( encoder_session: InferenceSession decoder_session: InferenceSession config: PretrainedConfig decoder_with_past_session: typing.Optional[onnxruntime.capi.onnxruntime_inference_collection.InferenceSession] = None use_io_binding: typing.Optional[bool] = None model_save_dir: typing.Union[str, pathlib.Path, tempfile.TemporaryDirectory, NoneType] = None preprocessors: typing.Optional[typing.List] = None **kwargs )
Speech Sequence-to-sequence model with a language modeling head for ONNX Runtime inference.
forward
< source >( input_features: typing.Optional[torch.FloatTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None labels: typing.Optional[torch.LongTensor] = None **kwargs )
Parameters
-
input_features (
torch.FloatTensor
) — Mel features extracted from the raw speech waveform.(batch_size, feature_size, encoder_sequence_length)
. -
decoder_input_ids (
torch.LongTensor
) — Indices of decoder input sequence tokens in the vocabulary of shape(batch_size, decoder_sequence_length)
. -
encoder_outputs (
torch.FloatTensor
) — The encoderlast_hidden_state
of shape(batch_size, encoder_sequence_length, hidden_size)
. -
past_key_values (
tuple(tuple(torch.FloatTensor), *optional*)
— Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. The tuple is of lengthconfig.n_layers
with each tuple having 2 tensors of shape(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)
and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
The ORTModelForSpeechSeq2Seq
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of text generation:
>>> from transformers import AutoProcessor
>>> from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
>>> from datasets import load_dataset
>>> processor = AutoProcessor.from_pretrained("optimum/whisper-tiny.en")
>>> model = ORTModelForSpeechSeq2Seq.from_pretrained("optimum/whisper-tiny.en")
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> inputs = processor.feature_extractor(ds[0]["audio"]["array"], return_tensors="pt")
>>> gen_tokens = model.generate(inputs=inputs.input_features)
>>> outputs = processor.tokenizer.batch_decode(gen_tokens)
Example using transformers.pipeline
:
>>> from transformers import AutoProcessor
>>> from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
>>> from datasets import load_dataset
>>> processor = AutoProcessor.from_pretrained("optimum/whisper-tiny.en")
>>> model = ORTModelForSpeechSeq2Seq.from_pretrained("optimum/whisper-tiny.en")
>>> speech_recognition = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor)
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> pred = speech_recognition(ds[0]["audio"]["array"])
ORTModelForTokenClassification
class optimum.onnxruntime.ORTModelForTokenClassification
< source >( model = None config = None use_io_binding = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information. -
use_io_binding (
bool
, optional) — Whether to use IOBinding during inference to avoid memory copy between the host and devices. Defaults toTrue
if the device is CUDA, otherwise defaults toFalse
.
Onnx Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Token Classification model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForTokenClassification
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of token classification:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForTokenClassification
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")
>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")
>>> onnx_ner = pipeline("token-classification", model=model, tokenizer=tokenizer)
>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_ner(text)
Prepares the buffer of logits with a 1D tensor on shape: (batch_size, sequence_length, config.num_labels).