Hinglish TTS — sub-100M (89.96M)
A 89.96M-parameter fixed-voice Hindi+English (Hinglish) code-switch TTS, compressed from a 443M XTTS-v2 fine-tune down to under 100M while holding quality. On a held-out powered set (n=225) it is statistically at parity with its own 265M teacher on code-switch accent, and passes naturalness (UTMOS) and voice fidelity (SECS), with a lower runaway-generation rate.
It speaks 4 fixed voices (aadya, arjun, kaustubh, maya). It is not a zero-shot cloning model — dropping general speaker capacity is what makes <100M reachable for code-switch speech.
Lineage
| model | params | repo |
|---|---|---|
| 443M original Hinglish fine-tune | 443M | harrrshall/xtts-v2-hinglish-synthetic |
| 265M distilled + RL | 265M | harrrshall/xtts-hinglish-265m |
| 90M staged-prune + RFT (this model) | 89.96M | this repo |
Model comparison (same held-out Hinglish set, n=225, same decode + scorer)
| Model | Params | Accent ↑ | SECS ↑ | Tail ↓ |
|---|---|---|---|---|
| XTTS-Hinglish-443M | 443M | 0.861 | 0.855 | 4.9% |
| XTTS-Hinglish-265M | 265M | 0.831 | 0.860 | 6.7% |
| XTTS-Hinglish-90M (this) | 89.96M | 0.820 | 0.851 | 4.4% |
| Kokoro-82M | 82M | 0.886* | n/a** | n/a |
* Kokoro's accent (English-word recall) is favoured by its English-primary design. ** Kokoro uses its own single voice (no target-voice cloning) and is not code-switch tuned, so SECS does not apply. The point of this row: a generic 82M TTS does not deliver fixed-voice Hindi-English code-switch; this 90M model does, at the 265M teacher's quality.
Certification (held-out n=225, paired vs the 265M teacher, bootstrap 95% CI + TOST)
| axis | this 90M | 265M teacher | delta | 95% CI |
|---|---|---|---|---|
| code-switch accent | 0.820 | 0.831 | -0.011 | [-0.038, +0.016] |
| voice fidelity (SECS) | 0.851 | 0.860 | -0.009 | [-0.014, -0.003] |
| runaway-tail rate | 4.4% | 6.7% |
Accent is statistically even (delta -0.011) and SECS passes non-inferiority, with a lower failure tail than the teacher. A 3x smaller model at the same code-switch quality.
How it was built
- Structured width-prune, staged: d=1024 -> d=768 -> d=640, with a distillation-recovery pass between each cut. A one-shot d=1024 -> 640 cut failed (the model stopped following text); the staged route, with each student initialized from the recovered intermediate, reached parity. Heads 16 -> 10 (head_dim 64 kept), FFN 4096 -> 2560, 16 layers.
- Fixed-voice specialization: the speaker encoder + perceiver are dropped; 4 voices are baked (32x640 conditioning latents + a 512-d vocoder d-vector each). A learned 640 -> 1024 adapter feeds the frozen base XTTS HiFi-GAN.
- Multi-signal distillation from the 265M teacher (code CE + logit-KL + latent MSE/cos), then RFT on the model's own best rollouts to suppress the runaway tail and lock code-switch faithfulness.
Usage
pip install coqui-tts soundfile
python inference.py --voice maya --text "आज office में एक important meeting है तो मैं busy रहूँगा" --out out.wav
The frozen HiFi-GAN vocoder, tokenizer, and DVAE come from the public base XTTS-v2 (auto-downloaded by
coqui-tts on first run). Only the 90M GPT + adapter + baked voices are in student640b_rft.pt.
Notes
- Write Hindi in Devanagari, English in Latin, language tag
"hi". Spell numbers as words. - Chunk text over ~150 characters.
- Greedy decoding with
repetition_penalty≈1.3is the most faithful.
Files
student640b_rft.pt— the 90M GPT + 640->1024 adapter + 4 baked voicesstudent640.py— the model definition (structured slicing,Student640,build_student_gpt)inference.py— self-contained inference
UTMOS is English-MOS-trained and used here only as a relative not-degraded-vs-teacher signal, not an absolute Hinglish naturalness score.
