XTTS-v2 Hinglish (trained on 100% synthetic data)

A Hinglish (Hindi-English code-switched) text-to-speech model. It is XTTS-v2 fine-tuned on a corpus that is entirely synthetic (generated by a teacher TTS), with no real human training audio. The point of the project was to test whether a synthetic-only corpus can produce a Hinglish model that does not lose the teacher's quality. It can: across four objective metrics on 89 held-out, sentence-and-voice-matched pairs, the student stays within 3% of the teacher.

The headline contribution is the synthetic-data recipe, documented in full below.

Results (student vs teacher, n=89 paired held-out, bootstrap 95% CI)

Axis	Metric	Delta (student - teacher)	95% CI
Intelligibility	Qwen3-ASR content recall	-1.6%	[-3.3%, 0.0%]
Accent (English-as-English)	English-ASR English-word recall	-3.0%	[-6.2%, -0.1%]
Naturalness	UTMOS (utmos22_strong)	+0.10	[+0.02, +0.18]
Voice copy-fidelity	resemblyzer speaker cosine	+0.9%	[+0.3%, +1.5%]

The teacher was confirmed natural by a human listener before distillation. The student matches it within small margins, so it inherits a verified-good quality bar. UTMOS is an English-trained proxy, so treat it as a relative not-degraded signal, not an absolute naturalness score.

Voices

Four fixed voices: kaustubh and arjun (male), maya and aadya (female). Each ships with a reference clip under refs/. The model is a zero-shot cloner, so you condition on a reference clip at inference; use the bundled refs to get these exact voices.

How to use

Dependency set that is known to work (XTTS on recent torch needs torchcodec, and coqui-tts 0.27.5 needs transformers in the 4.57.x range):

pip install "coqui-tts==0.27.5" "transformers==4.57.6" torch torchaudio torchcodec

from huggingface_hub import snapshot_download
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import soundfile as sf

repo = snapshot_download("harrrshall/xtts-v2-hinglish-synthetic")

config = XttsConfig(); config.load_json(f"{repo}/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path=f"{repo}/model.pth",
                      vocab_path=f"{repo}/vocab.json", use_deepspeed=False)
model.cuda().eval()

# clone one of the four voices from its reference clip
gpt_cond, speaker_emb = model.get_conditioning_latents(audio_path=[f"{repo}/refs/maya.wav"])

text = "Please इस bug का fix deploy कर दो और मुझे pull request का link भेज देना।"
out = model.inference(text, "hi", gpt_cond, speaker_emb, temperature=0.7)
sf.write("out.wav", out["wav"], 24000)

Input conventions (important):

Write Hindi words in Devanagari and English words in Latin (e.g. मुझे वो startup का idea पसंद आया).
Spell numbers as words, not digits. The 'hi' number expander is not implemented and digits raise an error.
Keep text under ~150 characters per call (the 'hi' path warns and may truncate longer text); chunk longer input on clause boundaries.
Language token is "hi" for Hinglish (the Hindi tokenizer carries the Latin English spans).

Smaller download — fp16 build (optional, no quality loss)

model_fp16.pth (~1.04 GB, half of model.pth's 2.08 GB) is an optional smaller build. The weights are stored in fp16 but upcast to fp32 when the model loads, so inference math is unchanged — this is not runtime half-precision (model.half()), which XTTS's vocoder can't do safely. To use it, just swap the checkpoint path:

model.load_checkpoint(config, checkpoint_path=f"{repo}/model_fp16.pth",
                      vocab_path=f"{repo}/vocab.json", use_deepspeed=False)

No-quality-loss was verified three independent ways (paired, with bootstrap 95% CIs):

Test	Result
Numerical equivalence (deterministic)	latent/speaker cosine ≈ 0.999999; greedy decode bit-identical tokens on all voices; vocoder SNR 35-42 dB
Greedy-paired (isolates fp16 rounding)	UTMOS −0.010 [−0.034, +0.003]; SECS −0.0005 [−0.0016, +0.0001]
Sampled-paired (temp 0.7, real usage)	UTMOS +0.007 [−0.005, +0.020]; SECS −0.0009 [−0.0021, −0.0000]

Every delta is within metric noise and an order of magnitude below the student-vs-teacher gaps above. Method and scripts: scripts/hinglish/fp16/ and docs/FP16_VERIFICATION.md in the repo.

The synthetic-data recipe (the core of this project)

No real Hinglish speech was used for training. The corpus was manufactured, filtered, and balanced in five steps.

1. Teacher

Speech was generated by the teacher TTS, which does native intra-sentence Hindi-English code-switching. A human verified a sample of teacher clips as natural before distillation, so the teacher defines the quality bar. Four Indian voices were used as the fixed voice set.

2. Text corpus (real seed + generated, two quality gates)

Seed: ~1,497 real spontaneous Hinglish transcripts (in-the-wild YouTube speech) provided the natural code-switch distribution and a hard out-of-domain eval set.
Generation: dense code-switched Hinglish was LLM-generated across 8 conversational domains (assistant replies, customer support, friends, vlog, tech/work, reactions, errands, opinions), because the seed lacked a high-code-switch tail.
Code-mixing target: measured by the Code-Mixing Index (minority-language fraction over Hindi/English content tokens): none=0, low<=0.20, med<=0.40, high>0.40. The generation targeted the high bin.
Gate 1 (lexicon, deterministic): a Hindi word written in Latin would be miscounted as English and fake a high-code-switch score. Each Latin token is classified English only if it is in an English dictionary and not a romanized-Hindi-only form (the Dakshina romanized-Hindi lexicon minus the English dictionary). The true cmi_bin is recomputed; script violations, low-bin items, and near-duplicates are dropped.
Gate 2 (LLM naturalness): a conservative pass dropped lines that pass the metric but no real speaker would say (forced code-switch, garbled numbers, romanized-Hindi homographs a TTS would mispronounce).
Outcome: 1,067 verified high-code-switch transcripts; the full corpus (with seed) is 1,470 rows, 36% high-code-switch, token-entropy 1.05x the real baseline (more diverse than real).

3. Synthesis

Each transcript was voiced by the teacher across all four voices at 24 kHz, giving 5,880 clips, 8.56 hours.

4. Filtering (convention-robust, bin-aware)

Recognizer: Qwen3-ASR transcribes each synthetic clip; quality is gated on whether the intended words are recovered.
Convention-robust matching: the ASR transcribes English loanwords phonetically in Devanagari, so a naive WER is meaningless across scripts. Acceptance uses a content-word recall that romanizes both sides to one space and matches on edit-ratio plus a phonetic fold plus a consonant skeleton, so a correct English word counts as recovered even when written in Devanagari.
Bin-aware thresholds: the ASR's recall ceiling drops with code-switch density even on known-good audio, so a flat threshold gutted the high-code-switch clips (35% kept). Thresholds are set per bin, a small margin below the recognizer's verified-good ceiling (recall floors: none 0.85, low 0.83, med 0.76, high 0.70). This recovers good dense clips without admitting degraded ones.
Outcome: 2,945 accepted clips (2,855 train / 90 eval), four voices balanced.

5. Collapse guardrails

Following the Stability-Expressivity work on "Synthetic Erosion", token entropy and 4-gram repetition are tracked at corpus, manifest, and student-output stages, so pure-synthetic scale does not quietly flatten prosody.

The training process

Base model: coqui/XTTS-v2. Only the autoregressive GPT component is fine-tuned (the DVAE and the decoder/vocoder are inherited from the base model).
Multi-speaker: all four voices are trained together; each sample conditions on a reference clip of its own speaker. Language token "hi".
Key model args (GPTArgs): max_text_length=200, max_wav_length=255995 (~11.6 s), max_conditioning_length=132300, gpt_use_perceiver_resampler=True, gpt_use_masking_gt_prompt_approach=True; audio config sample_rate=22050, output_sample_rate=24000.
Optimizer / schedule: AdamW, betas [0.9, 0.96], eps 1e-8, weight decay 1e-2, learning rate 5e-6, MultiStepLR.
Run: batch size 8, gradient accumulation 4 (effective batch 32), 8 epochs, single NVIDIA H200. A smoke run (a few steps on a tiny subset) validated the loop before the full run. loss_mel_ce fell from ~3.4 to ~2.7; clean exit; best checkpoint saved.
Working dependency set: torch 2.12, torchaudio, torchcodec 0.14, transformers 4.57.6, coqui-tts 0.27.5. (transformers 5.x removes a symbol coqui-tts imports; torch >=2.9 requires torchcodec for audio IO.)

Evaluation methodology

Four programmatic metrics on 89 held-out sentences, synthesized by both the student and the teacher on identical sentences and voices, with paired bootstrap 95% CIs: Qwen3-ASR content recall (intelligibility), an English-ASR English-word recall (a code-switch accent proxy that checks the embedded English is recovered as English), UTMOS (naturalness), and resemblyzer speaker cosine (voice copy-fidelity, same-voice 0.87 vs cross-voice 0.61).

Limitations

License: non-commercial. This is a derivative of XTTS-v2 under the Coqui Public Model License (CPML); it inherits CPML and is for research / non-commercial use. The training audio was generated by teacher TTS; review their terms before any redistribution or commercial use.
Four fixed voices from one teacher engine; the voices are a fairly homogeneous family.
Accent is ~3% below the teacher and sentence-initial English markers (e.g. "Wait") are occasionally mispronounced.
Operational: spell numbers as words (digits error out), chunk text over ~150 chars.
Absolute naturalness is not human-MOS-certified; the evidence is relative parity to a human-verified teacher, plus an English-biased UTMOS proxy.
Single teacher: style and artifact diversity are bounded by teacher TTS.

Provenance and intended use

Research artifact demonstrating a synthetic-data pipeline for low-resource code-switched TTS. Not a production voice product. Do not use it to impersonate real individuals; the voices are synthetic teacher voices, and any cloning should respect consent and the licenses above.

Downloads last month: 65

Model tree for harrrshall/xtts-v2-hinglish-synthetic

Base model

coqui/XTTS-v2

Finetuned

(73)

this model