Optimized MMS-TTS-ENG with ONNX Runtime

This repository contains an optimized version of the facebook/mms-tts-eng Text-to-Speech model for fast CPU inference using ONNX Runtime and dynamic quantization. It demonstrates how to convert the model to ONNX, quantize it, and run inference efficiently. It also includes an example of uploading the converted model and tokenizer to the Hugging Face Hub.

Features

ONNX Conversion: Converts the facebook/mms-tts-eng PyTorch model to ONNX format for optimized inference.
Dynamic Quantization: Applies dynamic quantization (float32 to int8) to reduce model size and improve CPU inference speed.
Fast CPU Inference: Leverages ONNX Runtime for efficient CPU-based speech generation.
Google Colab Compatible: Provides complete, runnable code examples for Google Colab.
Hugging Face Hub Integration: Includes code to upload the converted model and tokenizer to the Hugging Face Hub for easy sharing and deployment.
Seeded Generation: Includes an example of seeded generation for reproducible (though still non-deterministic across different seeds) outputs.
Speed Comparison: Demonstrates how to compare the inference speed of the ONNX Runtime optimized model with the original PyTorch model (with torch.compile).

Requirements

Python 3.7+
transformers
accelerate
scipy
onnxruntime
optimum
onnx
huggingface_hub

You can install the required packages using pip:

pip install --upgrade transformers accelerate scipy onnxruntime optimum onnx huggingface_hub