Optimum documentation

Optimum Inference with ONNX Runtime

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Optimum Inference with ONNX Runtime

Optimum is a utility package for building and running inference with accelerated runtime like ONNX Runtime. Optimum can be used to load optimized models from the Hugging Face Hub and create pipelines to run accelerated inference without rewriting your APIs.

Switching from Transformers to Optimum Inference

The Optimum Inference models are API compatible with Hugging Face Transformers models. This means you can just replace your AutoModelForXxx class with the corresponding ORTModelForXxx class in optimum. For example, this is how you can use a question answering model in optimum:

from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import ORTModelForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # pytorch checkpoint
+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # onnx checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)

question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = onnx_qa(question, context)

Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones. Simply pass from_transformers=True to the from_pretrained() method, and your model will be loaded and converted to ONNX on-the-fly:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification

# Load the model from the hub and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Create a pipeline
>>> onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

>>> result = onnx_classifier("This is a great model")
[{'label': 'POSITIVE', 'score': 0.9998838901519775}]

Working with the Hugging Face Model Hub

The Optimum model classes like ORTModelForSequenceClassification are integrated with the Hugging Face Model Hub, which means you can not only load model from the Hub, but also push your models to the Hub with push_to_hub() method. Below is an example which downloads a vanilla Transformers model from the Hub and converts it to an optimum onnxruntime model and pushes it back into a new repository.

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification

# Load the model from the hub and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Save the converted model
>>> model.save_pretrained("a_local_path_for_convert_onnx_model")
>>> tokenizer.save_pretrained("a_local_path_for_convert_onnx_model")

# Push the onnx model to HF Hub
>>> model.push_to_hub("a_local_path_for_convert_onnx_model", repository_id="my-onnx-repo", use_auth_token=True)

Export and inference of sequence-to-sequence models

Sequence-to-sequence (Seq2Seq) models, that generate a new sequence from an input, can also be used when running inference with ONNX Runtime. When Seq2Seq models are exported to the ONNX format, they are decomposed into three parts that are later combined during inference. Those three parts consist of the encoder, the “decoder” (which actually consists of the decoder with the language modeling head), and the “decoder” with pre-computed key/values as additional inputs. This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding. Here is an example on how you can export a T5 model to the ONNX format and run inference for a translation task:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM

# Load the model from the hub and export it to the ONNX format
>>> model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Create a pipeline
>>> onnx_translation = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)

>>> result = onnx_translation("My name is Eustache")
[{'translation_text': 'Mon nom est Eustache'}]

ORTModel

class optimum.onnxruntime.ORTModel

< >

( model = None config = None **kwargs )

Base ORTModel class for implementing models using ONNX Runtime. The ORTModel implements generic methods for interacting with the Hugging Face Hub as well as exporting vanilla transformers models to ONNX using transformers.onnx toolchain. The ORTModel implements additionally generic methods for optimizing and quantizing Onnx models.

load_model

< >

( path: typing.Union[str, pathlib.Path] provider = None )

Parameters

  • path (str or Path) — Directory from which to load the model.
  • provider(str, optional) — ONNX Runtime provider to use for loading the model. Defaults to CPUExecutionProvider.

Loads an ONNX Inference session with a given provider. Default provider is CPUExecutionProvider to match the default behaviour in PyTorch/TensorFlow/JAX.

to

< >

( device )

Changes the ONNX Runtime provider according to the device.

ORTModelForFeatureExtraction

class optimum.onnxruntime.ORTModelForFeatureExtraction

< >

( model = None config = None **kwargs )

Parameters

  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a MaskedLMOutput for feature-extraction tasks.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Feature Extraction model for ONNX.

forward

< >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The ORTModelForFeatureExtraction forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of feature extraction:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")

>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)

Example using transformers.pipeline:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> onnx_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer)

>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_extractor(text)

ORTModelForQuestionAnswering

class optimum.onnxruntime.ORTModelForQuestionAnswering

< >

( model = None config = None **kwargs )

Parameters

  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a QuestionAnsweringModelOutput for extractive question-answering tasks like SQuAD.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Question Answering model for ONNX.

forward

< >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The ORTModelForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of question answering:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])

>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits

Example using transformers.pipeline:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)

>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> pred = onnx_qa(question, text)

ORTModelForSequenceClassification

class optimum.onnxruntime.ORTModelForSequenceClassification

< >

( model = None config = None **kwargs )

Parameters

  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Sequence Classification model for ONNX.

forward

< >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The ORTModelForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of single-label classification:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)

Example using transformers.pipelines:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

>>> text = "Hello, my dog is cute"
>>> pred = onnx_classifier(text)

Example using zero-shot-classification transformers.pipelines:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> onnx_z0 = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

>>> sequence_to_classify = "Who are you voting for in 2020?"
>>> candidate_labels = ["Europe", "public health", "politics", "elections"]
>>> pred = onnx_z0(sequence_to_classify, candidate_labels, multi_class=True)

ORTModelForTokenClassification

class optimum.onnxruntime.ORTModelForTokenClassification

< >

( model = None config = None **kwargs )

Parameters

  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Token Classification model for ONNX.

forward

< >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The ORTModelForTokenClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of token classification:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForTokenClassification
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")

>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)

Example using transformers.pipelines:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForTokenClassification

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")
>>> onnx_ner = pipeline("token-classification", model=model, tokenizer=tokenizer)

>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_ner(text)

ORTModelForCausalLM

class optimum.onnxruntime.ORTModelForCausalLM

< >

( model = None config = None **kwargs )

Parameters

  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model with a causal language modeling head on top (linear layer with weights tied to the input embeddings).

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Causal LM model for ONNX.

forward

< >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None **kwargs )

Parameters

  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The ORTModelForCausalLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of text generation:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForCausalLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")

>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")

>>> gen_tokens = model.generate(**inputs,do_sample=True,temperature=0.9, min_length=20,max_length=20)
>>> tokenizer.batch_decode(gen_tokens)

Example using transformers.pipelines:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")
>>> onnx_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)

>>> text = "My name is Philipp and I live in Germany."
>>> gen = onnx_gen(text)

ORTModelForSeq2SeqLM

class optimum.onnxruntime.ORTModelForSeq2SeqLM

< >

( *args **kwargs )

Sequence-to-sequence model with a language modeling head for ONNX Runtime inference.

forward

< >

( input_ids: LongTensor = None attention_mask: typing.Optional[torch.FloatTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None **kwargs )

Parameters

  • input_ids (torch.LongTensor) — Indices of input sequence tokens in the vocabulary of shape (batch_size, encoder_sequence_length).
  • attention_mask (torch.LongTensor) — Mask to avoid performing attention on padding token indices, of shape (batch_size, encoder_sequence_length). Mask values selected in [0, 1].
  • decoder_input_ids (torch.LongTensor) — Indices of decoder input sequence tokens in the vocabulary of shape (batch_size, decoder_sequence_length).
  • encoder_outputs (torch.FloatTensor) — The encoder last_hidden_state of shape (batch_size, encoder_sequence_length, hidden_size).
  • past_key_values (tuple(tuple(torch.FloatTensor), *optional*) — Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. The tuple is of length config.n_layers with each tuple having 2 tensors of shape (batch_size, num_heads, decoder_sequence_length, embed_size_per_head) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

The ORTModelForSeq2SeqLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of text generation:

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
>>> model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")

>>> inputs = tokenizer("My name is Eustache and I like to", return_tensors="pt")

>>> gen_tokens = model.generate(**inputs)
>>> outputs = tokenizer.batch_decode(gen_tokens)

Example using transformers.pipeline:

>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
>>> model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")
>>> onnx_translation = pipeline("translation_en_to_de", model=model, tokenizer=tokenizer)

>>> text = "My name is Eustache."
>>> pred = onnx_translation(text)

ORTModelForImageClassification

class optimum.onnxruntime.ORTModelForImageClassification

< >

( model = None config = None **kwargs )

Parameters

  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • model (onnxruntime.InferenceSession) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.

Onnx Model for image-classification tasks.

This model inherits from [~onnxruntime.modeling_ort.ORTModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Image Classification model for ONNX.

forward

< >

( pixel_values: Tensor **kwargs )

Parameters

  • pixel_values (torch.Tensor of shape (batch_size, num_channels, height, width)) — Pixel values corresponding to the images in the current batch. Pixel values can be obtained from encoded images using AutoFeatureExtractor.

The ORTModelForImageClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of image classification:

>>> import requests
>>> from PIL import Image
>>> from optimum.onnxruntime import ORTModelForImageClassification
>>> from transformers import AutoFeatureExtractor

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> preprocessor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
>>> model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")

>>> inputs = preprocessor(images=image, return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits

Example using transformers.pipeline:

>>> import requests
>>> from PIL import Image
>>> from transformers import AutoFeatureExtractor, pipeline
>>> from optimum.onnxruntime import ORTModelForImageClassification

>>> preprocessor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
>>> model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")
>>> onnx_image_classifier = pipeline("image-classification", model=model, feature_extractor=preprocessor)

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> pred = onnx_image_classifier(url)