Optimum Inference with ONNX Runtime

Optimum is a utility package for building and running inference with accelerated runtime like ONNX Runtime. Optimum can be used to load optimized models from the Hugging Face Hub and create pipelines to run accelerated inference without rewriting your APIs.

Loading

Transformers models

Once your model was exported to the ONNX format, you can load it by replacing AutoModelForXxx with the corresponding ORTModelForXxx class.

  from transformers import AutoTokenizer, pipeline
- from transformers import AutoModelForCausalLM
+ from optimum.onnxruntime import ORTModelForCausalLM

- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # PyTorch checkpoint
+ model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="onnx") # ONNX checkpoint
  tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

  pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
  result = pipe("He never went out without a book under his arm")

More information for all the supported ORTModelForXxx in our documentation

Transformers pipelines

You can also load your ONNX model using ONNX Runtime pipelines which replace transformers.pipeline with optimum.onnxruntime.pipeline.

- from transformers import pipeline
+ from optimum.onnxruntime import pipeline

  model_id = "distilbert-base-uncased-finetuned-sst-2-english"
  nlp_pipeline = pipeline("sentiment-analysis", model=model_id)
  result = nlp_pipeline("I've been waiting for a HuggingFace course my whole life.")

More information for all the supported ORTXxxPipeline in our documentation

Diffusers models

Once your model was exported to the ONNX format, you can load it by replacing DiffusionPipeline with the corresponding ORTDiffusionPipeline class.

- from diffusers import DiffusionPipeline
+ from optimum.onnxruntime import ORTDiffusionPipeline

  model_id = "runwayml/stable-diffusion-v1-5"
- pipeline = DiffusionPipeline.from_pretrained(model_id)
+ pipeline = ORTDiffusionPipeline.from_pretrained(model_id, export=True)
  prompt = "sailing ship in storm by Leonardo da Vinci"
  image = pipeline(prompt).images[0]

More information for all the supported ORTXxxPipeline in our documentation

Sentence Transformers models

Once your model was exported to the ONNX format, you can load it by replacing AutoModel with the corresponding ORTModelForFeatureExtraction class.

  from transformers import AutoTokenizer
- from transformers import AutoModel
+ from optimum.onnxruntime import ORTModelForFeatureExtraction

  tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
- model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
+ model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
  inputs = tokenizer("This is an example sentence", return_tensors="pt")
  outputs = model(**inputs)

You can also load your ONNX model directly using the sentence_transformers.SentenceTransformer class, just make sure to have sentence-transformers>=3.2 installed. If the model wasn’t already converted to ONNX, it will be converted automatically on-the-fly.

  from sentence_transformers import SentenceTransformer

  model_id = "sentence-transformers/all-MiniLM-L6-v2"
- model = SentenceTransformer(model_id)
+ model = SentenceTransformer(model_id, backend="onnx")

  sentences = ["This is an example sentence", "Each sentence is converted"]
  embeddings = model.encode(sentences)

Timm models

Once your model was exported to the ONNX format, you can load it by replacing the create_model with the corresponding ORTModelForImageClassification class.

  import requests
  from PIL import Image
- from timm import create_model
  from timm.data import resolve_data_config, create_transform
+ from optimum.onnxruntime import ORTModelForImageClassification

- model = create_model("timm/mobilenetv3_large_100.ra_in1k", pretrained=True)
+ model = ORTModelForImageClassification.from_pretrained("optimum/mobilenetv3_large_100.ra_in1k")
  transform = create_transform(**resolve_data_config(model.config.pretrained_cfg, model=model))
  url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png"
  image = Image.open(requests.get(url, stream=True).raw)
  inputs = transform(image).unsqueeze(0)
  outputs = model(inputs)

Converting your model to ONNX on-the-fly

In case your model wasn’t already converted to ONNX, ORTModel includes a method to convert your model to ONNX on-the-fly. Simply pass export=True to the from_pretrained() method, and your model will be loaded and converted to ONNX on-the-fly:

>>> from optimum.onnxruntime import ORTModelForSequenceClassification

>>> # Load the model from the hub and export it to the ONNX format
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

Pushing your model to the Hub

You can also call push_to_hub directly on your model to upload it to the Hub.

>>> from optimum.onnxruntime import ORTModelForSequenceClassification

>>> # Load the model from the hub and export it to the ONNX format
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

>>> # Save the converted model locally
>>> output_dir = "a_local_path_for_convert_onnx_model"
>>> model.save_pretrained(output_dir)

# Push the onnx model to HF Hub
>>> model.push_to_hub(output_dir, repository_id="my-onnx-repo")

optimum-onnx