Instructions to use IOTEverythin/roxi-tts-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use IOTEverythin/roxi-tts-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="IOTEverythin/roxi-tts-v2", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("IOTEverythin/roxi-tts-v2", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Roxi-TTS v2 β Indian-English voice (MOSS-TTS-Nano LoRA fine-tune)
A LoRA fine-tune of MOSS-TTS-Nano (0.1B, autoregressive audio-token + LLM, 48 kHz) that speaks Indian English as its default voice β no reference clip required. Built for conversational / customer-support use.
Successor to
IOTEverythin/voxi-tts(Kokoro-82M, EMNS). This v2 moves to the MOSS-TTS-Nano family and adapts the voice with LoRA (full fine-tuning catastrophically forgets on a 0.1B model; LoRA adapts the voice while preserving the base's intelligibility).
What it is
- Base: OpenMOSS-Team/MOSS-TTS-Nano (Apache-2.0) Β· audio tokenizer OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano (Apache-2.0)
- Method: LoRA (PEFT) β r=16, Ξ±=32, targets
c_attn,c_proj,fc_in,fc_out(2.13% params), BF16, merged into a full checkpoint. - Output: 48 kHz mono.
Results (measured)
| Metric | Base MOSS | Roxi-TTS v2 (no reference) |
|---|---|---|
| Speaker similarity to target (WavLM-SV cosine) β | 0.52 | 0.96 |
| Intelligibility WER (Whisper, on generated audio) β | 0.26 | 0.26 (preserved) |
The voice became the target Indian-English speaker without a reference clip, with intelligibility unchanged.
Requirements
This repo's custom modeling code includes a cross-version compatibility fix, so it loads on
both transformers==4.57.1 and modern Transformers (tested 5.12.1) β the older
TypeError: unsupported operand type(s) for |: 'list' and 'set' is resolved. Install:
pip install transformers torch torchaudio soundfile sentencepiece numpy huggingface_hub
# GPU (Blackwell/most NVIDIA), if needed:
# pip install torch==2.7.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
torchaudio is required (the modeling code imports it). The MISSING ..._lm_head.weight line in
the load log is cosmetic β those heads are tied weights, rebound to the embeddings on load.
For exact parity with the training environment you may still pin transformers==4.57.1.
Usage
import torch
from transformers import AutoModelForCausalLM
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
"IOTEverythin/roxi-tts-v2", trust_remote_code=True, torch_dtype=torch.float32,
).to(device).eval()
res = model.inference(
text="Welcome. Your appointment is confirmed for Monday at ten thirty in the morning.",
output_audio_path="out.wav", mode="continuation",
audio_tokenizer_type="moss-audio-tokenizer-nano",
audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
device=device, audio_repetition_penalty=1.1, use_kv_cache=True,
)
# res["sample_rate"] == 48000; audio written to out.wav
Tips: spell brand names phonetically (e.g. "Voz Vox") and avoid raw abbreviations ("in the
morning", not "A M"); write numbers as words. Trim trailing silence and re-run if a generation
comes out short (autoregressive models occasionally under-generate). Verified working on
transformers==4.57.1, torch==2.7.0.
Training data & attribution
- Dataset: IIT-Madras Indic TTS β English (Indian-English) subset, via the
SPRINGLab/IndicTTS-EnglishHugging Face mirror (studio 48 kHz read speech). - The fine-tune was trained on a single-speaker subset of that corpus.
Required notice (IIT-M Indic TTS End User License Agreement):
COPYRIGHT 2016 TTS Consortium, TDIL, Meity β represented by Hema A. Murthy & S. Umesh, Department of Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.
The Indic TTS EULA grants a royalty-free, worldwide license to create and freely distribute derivative works (such as this model). See https://www.iitm.ac.in/donlab/indictts/ for the dataset and full license.
Limitations & responsible use
- Trained on a single read-speech speaker; neutral style. Style/emotion control is not reliable yet (instruction-conditioning is wired but needs style-labeled training).
- Telephony (8 kHz) quality not separately tuned; evaluate before production.
- Voice likeness: this voice is derived from a real dataset speaker. Do not use it to impersonate any real person, for fraud, deception, or any unlawful/harmful purpose. Disclose AI-generated audio where required. The authors provide the weights "as is", without warranty.
License
- This model's LoRA/code: Apache-2.0 (matching the base model).
- Derived from MOSS-TTS-Nano (Apache-2.0) and IIT-M Indic TTS data (notice above retained).
- Downloads last month
- 144