Roxi-TTS v2 — ONNX / CPU build

ONNX export of IOTEverythin/roxi-tts-v2 (MOSS-TTS-Nano LoRA, Indian-English) for CPU / on-device inference via onnxruntime — no PyTorch required at inference. 48 kHz output.

Use this ONNX build for fast CPU serving. For maximum voice fidelity, use the PyTorch build IOTEverythin/roxi-tts-v2 (see quality note).

Browser/CPU ONNX bundle exported with MOSS-TTS-Nano's official exporter: moss_tts_prefill.onnx, moss_tts_decode_step.onnx, moss_tts_local_* graphs, shared external data (*_shared.data), tokenizer.model, browser_poc_manifest.json, tts_browser_onnx_meta.json. The audio codec is the official Apache-2.0 OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX (not re-hosted here).

Usage (onnxruntime, CPU)

Uses the runtime from the MOSS-TTS-Nano repo (infer_onnx.py). Lay the assets out as <dir>/MOSS-TTS-Nano-100M-ONNX + <dir>/MOSS-Audio-Tokenizer-Nano-ONNX:

pip install "onnxruntime>=1.20" soundfile numpy
git clone https://github.com/OpenMOSS/MOSS-TTS-Nano && cd MOSS-TTS-Nano
huggingface-cli download IOTEverythin/roxi-tts-v2-onnx --local-dir assets/MOSS-TTS-Nano-100M-ONNX
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX --local-dir assets/MOSS-Audio-Tokenizer-Nano-ONNX

python infer_onnx.py --model-dir assets \
  --text "Welcome. Your appointment is confirmed for Monday at ten thirty in the morning." \
  --output-audio-path out.wav --execution-provider cpu --cpu-threads 8 \
  --sample-mode full --audio-temperature 0.8 --audio-top-p 0.9 --audio-repetition-penalty 1.1 \
  --disable-wetext-processing

Recommended sampling (tuned for voice fidelity): --sample-mode full --audio-temperature 0.8 --audio-top-p 0.9 --audio-repetition-penalty 1.1. (Greedy collapses the voice; the default fixed mode is lower fidelity.)

Quality (measured, CPU, vs the PyTorch build)

Build	Speaker similarity to target ↑	Intelligibility WER ↓
roxi-tts-v2 (PyTorch)	0.96	0.26
roxi-tts-v2-onnx (this, tuned)	0.73	0.25
base MOSS (reference)	0.52	0.26

Intelligibility is on par with PyTorch (WER ~0.25).
Voice identity is somewhat weaker than PyTorch (0.73 vs 0.96) — the ONNX streaming/fixed-frame decode + fp32 approximate the PyTorch generation path. It is still clearly the fine-tuned Indian-English voice (well above base 0.52). Use the PyTorch build if you need full fidelity.
Speed: ONNX is the faster CPU path (independent reports ~RTF 0.23–0.28 @ 8 threads, ~2× the PyTorch CPU path) once loaded in a persistent server.

Attribution & license

Apache-2.0 (matching base). Built on MOSS-TTS-Nano (Apache-2.0) + audio tokenizer (Apache-2.0).
Training data: IIT-Madras Indic TTS (English) via SPRINGLab/IndicTTS-English. Required notice: "COPYRIGHT 2016 TTS Consortium, TDIL, Meity — Hema A. Murthy & S. Umesh — IIT Madras. ALL RIGHTS RESERVED."
Responsible use: derived from a real dataset speaker — do not use to impersonate real people, for fraud, or deception; disclose AI-generated audio where required.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for IOTEverythin/roxi-tts-v2-onnx

Base model

OpenMOSS-Team/MOSS-TTS-Nano-100M

Adapter

IOTEverythin/roxi-tts-v2

Quantized

(1)

this model

IOTEverythin
/

roxi-tts-v2-onnx

Roxi-TTS v2 — ONNX / CPU build

Contents

Usage (onnxruntime, CPU)

Quality (measured, CPU, vs the PyTorch build)

Attribution & license

Model tree for IOTEverythin/roxi-tts-v2-onnx