Export a model to ExecuTorch with optimum.exporters.executorch

If you need to deploy 🤗 Transformers models for on-device use cases, we recommend exporting them to a serialized format that can be distributed and executed on specialized runtimes and hardware. In this guide, we’ll show you how to export these models to ExecuTorch.

Why ExecuTorch?

ExecuTorch is the ideal solution for deploying PyTorch models on edge devices, offering a streamlined process from export to deployment without leaving PyTorch ecosystem.

Supporting on-device AI presents unique challenges with diverse hardware, critical power requirements, low/no internet connectivity, and realtime processing needs. These constraints have historically prevented or slowed down the creation of scalable and performant on-device AI solutions. We designed ExecuTorch, backed by our industry partners like Meta, Arm, Apple, Qualcomm, MediaTek, etc. to be highly portable and provide superior developer productivity without losing on performance.

Summary

Exporting a PyTorch model to ExecuTorch is as simple as

optimum-cli export executorch \
  --model HuggingFaceTB/SmolLM2-135M \
  --task text-generation \
  --recipe xnnpack \
  --output_dir hf_smollm2 \
  --use_custom_sdpa

Check out the help for more options:

optimum-cli export executorch --help

Exporting a model to ExecuTorch using the CLI

The Optimum ExecuTorch export can be used through Optimum command-line:

optimum-cli export executorch --help

usage: optimum-cli export executorch [-h] -m MODEL [-o OUTPUT_DIR] [--task TASK] [--recipe RECIPE]

options:
  -h, --help            show this help message and exit

Required arguments:
  -m MODEL, --model MODEL
                        Model ID on huggingface.co or path on disk to load model from.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Path indicating the directory where to store the generated ExecuTorch model.
  --task TASK           The task to export the model for. Available tasks depend on the model, but are among: ['audio-classification', 'feature-extraction', 'image-to-text',
                        'sentence-similarity', 'depth-estimation', 'image-segmentation', 'audio-frame-classification', 'masked-im', 'semantic-segmentation', 'text-classification',
                        'audio-xvector', 'mask-generation', 'question-answering', 'text-to-audio', 'automatic-speech-recognition', 'image-to-image', 'multiple-choice', 'image-
                        classification', 'text2text-generation', 'token-classification', 'object-detection', 'zero-shot-object-detection', 'zero-shot-image-classification', 'text-
                        generation', 'fill-mask'].
  --recipe RECIPE       Pre-defined recipes for export to ExecuTorch. Defaults to "xnnpack".
  --use_custom_sdpa     For decoder-only models to use custom sdpa with static kv cache to boost performance. Defaults to False.

You should see a model.pte file is stored under “./hf_smollm2/“:

hf_smollm2/
└── model.pte

This will fetch the model on the Hub and exports the PyTorch model with the specialized recipe. The resulting model.pte file can then be run on the XNNPACK backend, or on many other ExecuTorh supported backends if exports with different recipes, e.g. Apple’s Core ML or MPS, Qualcomm’s SoCs, ARM’s Ethos-U, Xtensa HiFi4 DSP, Vulkan GPU, MediaTek, etc.

For example, we can load and run the model with ExecuTorch Runtime using the optimum.executorch package as follows:

from transformers import AutoTokenizer
from optimum.executorch import ExecuTorchModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
model = ExecuTorchModelForCausalLM.from_pretrained("hf_smollm2/")
prompt = "Simply put, the theory of relativity states that"
print(f"\nGenerated texts:\n\t{model.text_generation(tokenizer=tokenizer, prompt=prompt, max_seq_len=45)}")

As you can see, converting a model to ExecuTorch does not mean leaving the Hugging Face ecosystem. You end up with a similar API as regular 🤗 Transformers models!

In case your model wasn’t already exported to ExecuTorch, it can also be converted on-the-fly when loading your model:

from optimum.executorch import ExecuTorchModelForCausalLM

model = ExecuTorchModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M", recipe="xnnpack", attn_implementation="custom_sdpa")