XTTS v2 — Indian English Fine-Tune

A fine-tuned version of Coqui XTTS v2 adapted for Indian-accented English speech synthesis.

XTTS v2 is a 518M-parameter GPT-based multilingual TTS model with zero-shot voice cloning. This fine-tune improves naturalness, prosody, and pronunciation for Indian-English speakers and vocabulary — Indian names, Indian city names, lakh/crore number system, and tech acronyms.

Checkpoint Info

Detail	Value
Best step	11,074
Total steps trained	11,250
Best eval loss	2.697 (down from 3.766 at start)
Eval mel loss	2.670 (−1.065)
Eval text loss	0.027 (−0.003)

The checkpoint (best_model.pth) is a full training checkpoint — it contains model weights AND optimizer state. This means you can both run inference AND resume training from it.

Model weights only: ~2 GB
With optimizer state (Adam m + v buffers): 5.3 GB total

Files

File	Size	Description
`best_model.pth`	5.3 GB	Full training checkpoint (weights + optimizer state)
`config.json`	6 KB	Training config (required for inference and resuming training)

Additional files required for inference (not stored here — auto-downloaded by the TTS library on first use):

File	Source
`vocab.json`	Downloaded from Coqui's servers automatically
`dvae.pth`	Downloaded from Coqui's servers automatically
`mel_stats.pth`	Downloaded from Coqui's servers automatically

Inference

Install

pip install TTS>=0.22.0 torch>=2.1 torchaudio>=2.1

Basic inference (voice cloning)

from TTS.api import TTS

tts = TTS(model_path="best_model.pth", config_path="config.json").to("cuda")

tts.tts_to_file(
    text="The meeting is on 23rd April at 4:30 PM IST. Dr. Narayanan from Chennai will present.",
    speaker_wav="reference.wav",   # 5-10 sec clean WAV of the target speaker at 24 kHz
    language="en",
    file_path="output.wav"
)

Inference with the XTTS model directly

import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Load config and model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="best_model.pth", eval=True)
model.cuda()

# Get speaker conditioning from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["reference.wav"]
)

# Synthesize
out = model.inference(
    text="Dr. Narayanan Subramanian from Chennai will meet Aravind Sridhar tomorrow.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
    repetition_penalty=2.5,
    top_k=50,
    top_p=0.85,
)

import soundfile as sf
sf.write("output.wav", out["wav"], 24000)

Reference audio requirements

Format: WAV, mono
Sample rate: 24 kHz (22050 Hz also works — TTS library resamples)
Duration: 5–10 seconds (longer does not help)
Quality: Clean, no background noise, no music
Same speaker as your target voice

What it handles well

Indian names — Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar
Indian cities — Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai
Indian number system — Rs.7,25,000 → "seven lakh twenty five thousand rupees", crore
Tech acronyms — DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD
Natural Indian-English prosody — accent and rhythm of Indian English speech

Resuming Training / Fine-Tuning Further

The checkpoint includes full optimizer state, so training can be resumed exactly where it left off.

Dataset format

Pipe-delimited CSV, no header, 2 columns:

clip_0001|The total cost is Rs.7,25,000 for the Bengaluru office setup.
clip_0002|Dr. Narayanan will present the quarterly results tomorrow at 4 PM IST.
clip_0003|Please forward the report to admin at example dot com by Friday.

Audio files: wavs/clip_0001.wav, wavs/clip_0002.wav, etc.

Format: mono WAV, 22050 Hz, 1–24 seconds per clip

Resume training from this checkpoint

# In your train script, point the base model at this checkpoint:
XTTS_CHECKPOINT = "best_model.pth"     # this file
XTTS_CONFIG     = "config.json"        # this file

# Then run the standard XTTS v2 training recipe:
# TTS/recipes/ljspeech/xtts_v2/train_gpt_xtts.py
# with formatter="thorsten" for 2-column id|text metadata

Key config values used during training (from config.json):

Parameter	Value
`lr`	`5e-6`
`batch_size`	`2` (per GPU, 2× GPUs)
`eval_split_size`	`0.1` (10% held out)
`eval_split_max_size`	`80` clips
`save_step`	`500`
`lr_scheduler`	`MultiStepLR`
`lr_scheduler milestones`	`[3500, 7000, 10000]`
`lr_scheduler gamma`	`0.5` (halved at each milestone)
`gpt_cond_len`	`12` seconds
`gpt_cond_chunk_len`	`4` seconds
`gpt_max_audio_tokens`	`605`
`gpt_max_text_tokens`	`402`
`distributed_backend`	`nccl` (multi-GPU DDP)

Tips for further fine-tuning

Lower the LR when resuming — the model is already converged, use lr: 1e-6 or 2e-6
More data is better — 1000+ clips is the sweet spot; diminishing returns above 5000
Watch eval mel loss — should stay below 2.7; if it climbs, you're overfitting
Reference audio matters more than model — use a very clean 8-second clip for best voice cloning results
OOM during training — reduce max_wav_len to 220500 (10 sec) or lower batch_size to 1

Training History

Run	Steps	Best eval loss	Notes
Run 1 (Mar 17)	~1,650	2.735	Initial run, still converging
Run 2 (Mar 18)	~5,085	2.794	Different config, higher loss
Run 3 (Mar 18)	11,074	2.697	This checkpoint — best overall
Run 4 (Mar 21)	~6,885	2.829	Interrupted, worse result

Run 3 achieved the lowest eval loss across all experiments and is the checkpoint published here.

Credits

Coqui TTS / XTTS v2 — Base model and training framework (MIT License)
Thorsten Müller — Dataset format convention (id|text two-column pipe-delimited)

License

The fine-tuned weights follow the Coqui Public Model License (CPML) of the base XTTS v2 model.

Downloads last month: 81

Model tree for jeevav62/xtts-v2-indian-en

Base model

coqui/XTTS-v2

Finetuned

(69)

this model