Roxi-TTS v2 — Indian-English voice (MOSS-TTS-Nano LoRA fine-tune)

A LoRA fine-tune of MOSS-TTS-Nano (0.1B, autoregressive audio-token + LLM, 48 kHz) that speaks Indian English as its default voice — no reference clip required. Built for conversational / customer-support use.

Successor to IOTEverythin/voxi-tts (Kokoro-82M, EMNS). This v2 moves to the MOSS-TTS-Nano family and adapts the voice with LoRA (full fine-tuning catastrophically forgets on a 0.1B model; LoRA adapts the voice while preserving the base's intelligibility).

What it is

Base: OpenMOSS-Team/MOSS-TTS-Nano (Apache-2.0) · audio tokenizer OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano (Apache-2.0)
Method: LoRA (PEFT) — r=16, α=32, targets c_attn,c_proj,fc_in,fc_out (2.13% params), BF16, merged into a full checkpoint.
Output: 48 kHz mono.

Results (measured)

Metric	Base MOSS	Roxi-TTS v2 (no reference)
Speaker similarity to target (WavLM-SV cosine) ↑	0.52	0.96
Intelligibility WER (Whisper, on generated audio) ↓	0.26	0.26 (preserved)

The voice became the target Indian-English speaker without a reference clip, with intelligibility unchanged.

Requirements

This repo's custom modeling code includes a cross-version compatibility fix, so it loads on both transformers==4.57.1 and modern Transformers (tested 5.12.1) — the older TypeError: unsupported operand type(s) for |: 'list' and 'set' is resolved. Install:

pip install transformers torch torchaudio soundfile sentencepiece numpy huggingface_hub
# GPU (Blackwell/most NVIDIA), if needed:
#   pip install torch==2.7.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

torchaudio is required (the modeling code imports it). The MISSING ..._lm_head.weight line in the load log is cosmetic — those heads are tied weights, rebound to the embeddings on load. For exact parity with the training environment you may still pin transformers==4.57.1.

Usage

import torch
from transformers import AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    "IOTEverythin/roxi-tts-v2", trust_remote_code=True, torch_dtype=torch.float32,
).to(device).eval()

res = model.inference(
    text="Welcome. Your appointment is confirmed for Monday at ten thirty in the morning.",
    output_audio_path="out.wav", mode="continuation",
    audio_tokenizer_type="moss-audio-tokenizer-nano",
    audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
    device=device, audio_repetition_penalty=1.1, use_kv_cache=True,
)
# res["sample_rate"] == 48000; audio written to out.wav

Tips: spell brand names phonetically (e.g. "Voz Vox") and avoid raw abbreviations ("in the morning", not "A M"); write numbers as words. Trim trailing silence and re-run if a generation comes out short (autoregressive models occasionally under-generate). Verified working on transformers==4.57.1, torch==2.7.0.

Training data & attribution

Dataset: IIT-Madras Indic TTS — English (Indian-English) subset, via the SPRINGLab/IndicTTS-English Hugging Face mirror (studio 48 kHz read speech).
The fine-tune was trained on a single-speaker subset of that corpus.

Required notice (IIT-M Indic TTS End User License Agreement):

COPYRIGHT 2016 TTS Consortium, TDIL, Meity — represented by Hema A. Murthy & S. Umesh, Department of Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.

The Indic TTS EULA grants a royalty-free, worldwide license to create and freely distribute derivative works (such as this model). See https://www.iitm.ac.in/donlab/indictts/ for the dataset and full license.

Limitations & responsible use

Trained on a single read-speech speaker; neutral style. Style/emotion control is not reliable yet (instruction-conditioning is wired but needs style-labeled training).
Telephony (8 kHz) quality not separately tuned; evaluate before production.
Voice likeness: this voice is derived from a real dataset speaker. Do not use it to impersonate any real person, for fraud, deception, or any unlawful/harmful purpose. Disclose AI-generated audio where required. The authors provide the weights "as is", without warranty.

License

This model's LoRA/code: Apache-2.0 (matching the base model).
Derived from MOSS-TTS-Nano (Apache-2.0) and IIT-M Indic TTS data (notice above retained).

Downloads last month: 144

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for IOTEverythin/roxi-tts-v2

Base model

OpenMOSS-Team/MOSS-TTS-Nano-100M

Adapter

(3)

this model

Quantizations

1 model