Roxi-TTS v2 β€” Indian-English voice (MOSS-TTS-Nano LoRA fine-tune)

A LoRA fine-tune of MOSS-TTS-Nano (0.1B, autoregressive audio-token + LLM, 48 kHz) that speaks Indian English as its default voice β€” no reference clip required. Built for conversational / customer-support use.

Successor to IOTEverythin/voxi-tts (Kokoro-82M, EMNS). This v2 moves to the MOSS-TTS-Nano family and adapts the voice with LoRA (full fine-tuning catastrophically forgets on a 0.1B model; LoRA adapts the voice while preserving the base's intelligibility).

What it is

  • Base: OpenMOSS-Team/MOSS-TTS-Nano (Apache-2.0) Β· audio tokenizer OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano (Apache-2.0)
  • Method: LoRA (PEFT) β€” r=16, Ξ±=32, targets c_attn,c_proj,fc_in,fc_out (2.13% params), BF16, merged into a full checkpoint.
  • Output: 48 kHz mono.

Results (measured)

Metric Base MOSS Roxi-TTS v2 (no reference)
Speaker similarity to target (WavLM-SV cosine) ↑ 0.52 0.96
Intelligibility WER (Whisper, on generated audio) ↓ 0.26 0.26 (preserved)

The voice became the target Indian-English speaker without a reference clip, with intelligibility unchanged.

Requirements

This repo's custom modeling code includes a cross-version compatibility fix, so it loads on both transformers==4.57.1 and modern Transformers (tested 5.12.1) β€” the older TypeError: unsupported operand type(s) for |: 'list' and 'set' is resolved. Install:

pip install transformers torch torchaudio soundfile sentencepiece numpy huggingface_hub
# GPU (Blackwell/most NVIDIA), if needed:
#   pip install torch==2.7.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

torchaudio is required (the modeling code imports it). The MISSING ..._lm_head.weight line in the load log is cosmetic β€” those heads are tied weights, rebound to the embeddings on load. For exact parity with the training environment you may still pin transformers==4.57.1.

Usage

import torch
from transformers import AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    "IOTEverythin/roxi-tts-v2", trust_remote_code=True, torch_dtype=torch.float32,
).to(device).eval()

res = model.inference(
    text="Welcome. Your appointment is confirmed for Monday at ten thirty in the morning.",
    output_audio_path="out.wav", mode="continuation",
    audio_tokenizer_type="moss-audio-tokenizer-nano",
    audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
    device=device, audio_repetition_penalty=1.1, use_kv_cache=True,
)
# res["sample_rate"] == 48000; audio written to out.wav

Tips: spell brand names phonetically (e.g. "Voz Vox") and avoid raw abbreviations ("in the morning", not "A M"); write numbers as words. Trim trailing silence and re-run if a generation comes out short (autoregressive models occasionally under-generate). Verified working on transformers==4.57.1, torch==2.7.0.

Training data & attribution

  • Dataset: IIT-Madras Indic TTS β€” English (Indian-English) subset, via the SPRINGLab/IndicTTS-English Hugging Face mirror (studio 48 kHz read speech).
  • The fine-tune was trained on a single-speaker subset of that corpus.

Required notice (IIT-M Indic TTS End User License Agreement):

COPYRIGHT 2016 TTS Consortium, TDIL, Meity β€” represented by Hema A. Murthy & S. Umesh, Department of Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.

The Indic TTS EULA grants a royalty-free, worldwide license to create and freely distribute derivative works (such as this model). See https://www.iitm.ac.in/donlab/indictts/ for the dataset and full license.

Limitations & responsible use

  • Trained on a single read-speech speaker; neutral style. Style/emotion control is not reliable yet (instruction-conditioning is wired but needs style-labeled training).
  • Telephony (8 kHz) quality not separately tuned; evaluate before production.
  • Voice likeness: this voice is derived from a real dataset speaker. Do not use it to impersonate any real person, for fraud, deception, or any unlawful/harmful purpose. Disclose AI-generated audio where required. The authors provide the weights "as is", without warranty.

License

  • This model's LoRA/code: Apache-2.0 (matching the base model).
  • Derived from MOSS-TTS-Nano (Apache-2.0) and IIT-M Indic TTS data (notice above retained).
Downloads last month
144
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for IOTEverythin/roxi-tts-v2

Adapter
(3)
this model
Quantizations
1 model