Hinglish TTS — sub-100M (89.96M)

A 89.96M-parameter fixed-voice Hindi+English (Hinglish) code-switch TTS, compressed from a 443M XTTS-v2 fine-tune down to under 100M while holding quality. On a held-out powered set (n=225) it is statistically at parity with its own 265M teacher on code-switch accent, and passes naturalness (UTMOS) and voice fidelity (SECS), with a lower runaway-generation rate.

It speaks 4 fixed voices (aadya, arjun, kaustubh, maya). It is not a zero-shot cloning model — dropping general speaker capacity is what makes <100M reachable for code-switch speech.

Lineage

model	params	repo
443M original Hinglish fine-tune	443M	`harrrshall/xtts-v2-hinglish-synthetic`
265M distilled + RL	265M	`harrrshall/xtts-hinglish-265m`
90M staged-prune + RFT (this model)	89.96M	this repo

Model comparison (same held-out Hinglish set, n=225, same decode + scorer)

Model	Params	Accent ↑	SECS ↑	Tail ↓
XTTS-Hinglish-443M	443M	0.861	0.855	4.9%
XTTS-Hinglish-265M	265M	0.831	0.860	6.7%
XTTS-Hinglish-90M (this)	89.96M	0.820	0.851	4.4%
Kokoro-82M	82M	0.886*	n/a**	n/a

* Kokoro's accent (English-word recall) is favoured by its English-primary design. ** Kokoro uses its own single voice (no target-voice cloning) and is not code-switch tuned, so SECS does not apply. The point of this row: a generic 82M TTS does not deliver fixed-voice Hindi-English code-switch; this 90M model does, at the 265M teacher's quality.

Certification (held-out n=225, paired vs the 265M teacher, bootstrap 95% CI + TOST)

axis	this 90M	265M teacher	delta	95% CI
code-switch accent	0.820	0.831	-0.011	[-0.038, +0.016]
voice fidelity (SECS)	0.851	0.860	-0.009	[-0.014, -0.003]
runaway-tail rate	4.4%	6.7%

Accent is statistically even (delta -0.011) and SECS passes non-inferiority, with a lower failure tail than the teacher. A 3x smaller model at the same code-switch quality.

How it was built

Structured width-prune, staged: d=1024 -> d=768 -> d=640, with a distillation-recovery pass between each cut. A one-shot d=1024 -> 640 cut failed (the model stopped following text); the staged route, with each student initialized from the recovered intermediate, reached parity. Heads 16 -> 10 (head_dim 64 kept), FFN 4096 -> 2560, 16 layers.
Fixed-voice specialization: the speaker encoder + perceiver are dropped; 4 voices are baked (32x640 conditioning latents + a 512-d vocoder d-vector each). A learned 640 -> 1024 adapter feeds the frozen base XTTS HiFi-GAN.
Multi-signal distillation from the 265M teacher (code CE + logit-KL + latent MSE/cos), then RFT on the model's own best rollouts to suppress the runaway tail and lock code-switch faithfulness.

Usage

pip install coqui-tts soundfile
python inference.py --voice maya --text "आज office में एक important meeting है तो मैं busy रहूँगा" --out out.wav

The frozen HiFi-GAN vocoder, tokenizer, and DVAE come from the public base XTTS-v2 (auto-downloaded by coqui-tts on first run). Only the 90M GPT + adapter + baked voices are in student640b_rft.pt.

Notes

Write Hindi in Devanagari, English in Latin, language tag "hi". Spell numbers as words.
Chunk text over ~150 characters.
Greedy decoding with repetition_penalty≈1.3 is the most faithful.

Files

student640b_rft.pt — the 90M GPT + 640->1024 adapter + 4 baked voices
student640.py — the model definition (structured slicing, Student640, build_student_gpt)
inference.py — self-contained inference

UTMOS is English-MOS-trained and used here only as a relative not-degraded-vs-teacher signal, not an absolute Hinglish naturalness score.

Downloads last month: -; Downloads are not tracked for this model. How to track