Roxi-TTS v2 β€” ONNX / CPU build

ONNX export of IOTEverythin/roxi-tts-v2 (MOSS-TTS-Nano LoRA, Indian-English) for CPU / on-device inference via onnxruntime β€” no PyTorch required at inference. 48 kHz output.

Use this ONNX build for fast CPU serving. For maximum voice fidelity, use the PyTorch build IOTEverythin/roxi-tts-v2 (see quality note).

Contents

Browser/CPU ONNX bundle exported with MOSS-TTS-Nano's official exporter: moss_tts_prefill.onnx, moss_tts_decode_step.onnx, moss_tts_local_* graphs, shared external data (*_shared.data), tokenizer.model, browser_poc_manifest.json, tts_browser_onnx_meta.json. The audio codec is the official Apache-2.0 OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX (not re-hosted here).

Usage (onnxruntime, CPU)

Uses the runtime from the MOSS-TTS-Nano repo (infer_onnx.py). Lay the assets out as <dir>/MOSS-TTS-Nano-100M-ONNX + <dir>/MOSS-Audio-Tokenizer-Nano-ONNX:

pip install "onnxruntime>=1.20" soundfile numpy
git clone https://github.com/OpenMOSS/MOSS-TTS-Nano && cd MOSS-TTS-Nano
huggingface-cli download IOTEverythin/roxi-tts-v2-onnx --local-dir assets/MOSS-TTS-Nano-100M-ONNX
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX --local-dir assets/MOSS-Audio-Tokenizer-Nano-ONNX

python infer_onnx.py --model-dir assets \
  --text "Welcome. Your appointment is confirmed for Monday at ten thirty in the morning." \
  --output-audio-path out.wav --execution-provider cpu --cpu-threads 8 \
  --sample-mode full --audio-temperature 0.8 --audio-top-p 0.9 --audio-repetition-penalty 1.1 \
  --disable-wetext-processing

Recommended sampling (tuned for voice fidelity): --sample-mode full --audio-temperature 0.8 --audio-top-p 0.9 --audio-repetition-penalty 1.1. (Greedy collapses the voice; the default fixed mode is lower fidelity.)

Quality (measured, CPU, vs the PyTorch build)

Build Speaker similarity to target ↑ Intelligibility WER ↓
roxi-tts-v2 (PyTorch) 0.96 0.26
roxi-tts-v2-onnx (this, tuned) 0.73 0.25
base MOSS (reference) 0.52 0.26
  • Intelligibility is on par with PyTorch (WER ~0.25).
  • Voice identity is somewhat weaker than PyTorch (0.73 vs 0.96) β€” the ONNX streaming/fixed-frame decode + fp32 approximate the PyTorch generation path. It is still clearly the fine-tuned Indian-English voice (well above base 0.52). Use the PyTorch build if you need full fidelity.
  • Speed: ONNX is the faster CPU path (independent reports ~RTF 0.23–0.28 @ 8 threads, ~2Γ— the PyTorch CPU path) once loaded in a persistent server.

Attribution & license

  • Apache-2.0 (matching base). Built on MOSS-TTS-Nano (Apache-2.0) + audio tokenizer (Apache-2.0).
  • Training data: IIT-Madras Indic TTS (English) via SPRINGLab/IndicTTS-English. Required notice: "COPYRIGHT 2016 TTS Consortium, TDIL, Meity β€” Hema A. Murthy & S. Umesh β€” IIT Madras. ALL RIGHTS RESERVED."
  • Responsible use: derived from a real dataset speaker β€” do not use to impersonate real people, for fraud, or deception; disclose AI-generated audio where required.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for IOTEverythin/roxi-tts-v2-onnx

Quantized
(1)
this model