Roxi-TTS v2 β ONNX / CPU build
ONNX export of IOTEverythin/roxi-tts-v2
(MOSS-TTS-Nano LoRA, Indian-English) for CPU / on-device inference via onnxruntime β
no PyTorch required at inference. 48 kHz output.
Use this ONNX build for fast CPU serving. For maximum voice fidelity, use the PyTorch build
IOTEverythin/roxi-tts-v2(see quality note).
Contents
Browser/CPU ONNX bundle exported with MOSS-TTS-Nano's official exporter:
moss_tts_prefill.onnx, moss_tts_decode_step.onnx, moss_tts_local_* graphs,
shared external data (*_shared.data), tokenizer.model, browser_poc_manifest.json,
tts_browser_onnx_meta.json. The audio codec is the official Apache-2.0
OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX
(not re-hosted here).
Usage (onnxruntime, CPU)
Uses the runtime from the MOSS-TTS-Nano repo
(infer_onnx.py). Lay the assets out as <dir>/MOSS-TTS-Nano-100M-ONNX + <dir>/MOSS-Audio-Tokenizer-Nano-ONNX:
pip install "onnxruntime>=1.20" soundfile numpy
git clone https://github.com/OpenMOSS/MOSS-TTS-Nano && cd MOSS-TTS-Nano
huggingface-cli download IOTEverythin/roxi-tts-v2-onnx --local-dir assets/MOSS-TTS-Nano-100M-ONNX
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX --local-dir assets/MOSS-Audio-Tokenizer-Nano-ONNX
python infer_onnx.py --model-dir assets \
--text "Welcome. Your appointment is confirmed for Monday at ten thirty in the morning." \
--output-audio-path out.wav --execution-provider cpu --cpu-threads 8 \
--sample-mode full --audio-temperature 0.8 --audio-top-p 0.9 --audio-repetition-penalty 1.1 \
--disable-wetext-processing
Recommended sampling (tuned for voice fidelity): --sample-mode full --audio-temperature 0.8 --audio-top-p 0.9 --audio-repetition-penalty 1.1. (Greedy collapses the voice; the default
fixed mode is lower fidelity.)
Quality (measured, CPU, vs the PyTorch build)
| Build | Speaker similarity to target β | Intelligibility WER β |
|---|---|---|
| roxi-tts-v2 (PyTorch) | 0.96 | 0.26 |
| roxi-tts-v2-onnx (this, tuned) | 0.73 | 0.25 |
| base MOSS (reference) | 0.52 | 0.26 |
- Intelligibility is on par with PyTorch (WER ~0.25).
- Voice identity is somewhat weaker than PyTorch (0.73 vs 0.96) β the ONNX streaming/fixed-frame decode + fp32 approximate the PyTorch generation path. It is still clearly the fine-tuned Indian-English voice (well above base 0.52). Use the PyTorch build if you need full fidelity.
- Speed: ONNX is the faster CPU path (independent reports ~RTF 0.23β0.28 @ 8 threads, ~2Γ the PyTorch CPU path) once loaded in a persistent server.
Attribution & license
- Apache-2.0 (matching base). Built on MOSS-TTS-Nano (Apache-2.0) + audio tokenizer (Apache-2.0).
- Training data: IIT-Madras Indic TTS (English) via
SPRINGLab/IndicTTS-English. Required notice: "COPYRIGHT 2016 TTS Consortium, TDIL, Meity β Hema A. Murthy & S. Umesh β IIT Madras. ALL RIGHTS RESERVED." - Responsible use: derived from a real dataset speaker β do not use to impersonate real people, for fraud, or deception; disclose AI-generated audio where required.
Model tree for IOTEverythin/roxi-tts-v2-onnx
Base model
OpenMOSS-Team/MOSS-TTS-Nano-100M