XTTS v2 β Indian English Fine-Tune
π Recipe: xttsv2-recipe
A fine-tuned version of Coqui XTTS v2 adapted for Indian-accented English speech synthesis.
XTTS v2 is a 518M-parameter GPT-based multilingual TTS model with zero-shot voice cloning. This fine-tune improves naturalness, prosody, and pronunciation for Indian-English speakers and vocabulary β Indian names, Indian city names, lakh/crore number system, and tech acronyms.
Checkpoint Info
| Detail | Value |
|---|---|
| Best step | 11,074 |
| Total steps trained | 11,250 |
| Best eval loss | 2.697 (down from 3.766 at start) |
| Eval mel loss | 2.670 (β1.065) |
| Eval text loss | 0.027 (β0.003) |
The checkpoint (best_model.pth) is a full training checkpoint β it contains model weights AND optimizer state. This means you can both run inference AND resume training from it.
- Model weights only: ~2 GB
- With optimizer state (Adam m + v buffers): 5.3 GB total
Files
| File | Size | Description |
|---|---|---|
best_model.pth |
5.3 GB | Full training checkpoint (weights + optimizer state) |
config.json |
6 KB | Training config (required for inference and resuming training) |
Additional files required for inference (not stored here β auto-downloaded by the TTS library on first use):
| File | Source |
|---|---|
vocab.json |
Downloaded from Coqui's servers automatically |
dvae.pth |
Downloaded from Coqui's servers automatically |
mel_stats.pth |
Downloaded from Coqui's servers automatically |
Inference
Install
pip install TTS>=0.22.0 torch>=2.1 torchaudio>=2.1
Basic inference (voice cloning)
from TTS.api import TTS
tts = TTS(model_path="best_model.pth", config_path="config.json").to("cuda")
tts.tts_to_file(
text="The meeting is on 23rd April at 4:30 PM IST. Dr. Narayanan from Chennai will present.",
speaker_wav="reference.wav", # 5-10 sec clean WAV of the target speaker at 24 kHz
language="en",
file_path="output.wav"
)
Inference with the XTTS model directly
import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# Load config and model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="best_model.pth", eval=True)
model.cuda()
# Get speaker conditioning from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["reference.wav"]
)
# Synthesize
out = model.inference(
text="Dr. Narayanan Subramanian from Chennai will meet Aravind Sridhar tomorrow.",
language="en",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.7,
repetition_penalty=2.5,
top_k=50,
top_p=0.85,
)
import soundfile as sf
sf.write("output.wav", out["wav"], 24000)
Reference audio requirements
- Format: WAV, mono
- Sample rate: 24 kHz (22050 Hz also works β TTS library resamples)
- Duration: 5β10 seconds (longer does not help)
- Quality: Clean, no background noise, no music
- Same speaker as your target voice
What it handles well
- Indian names β Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar
- Indian cities β Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai
- Indian number system β
Rs.7,25,000β "seven lakh twenty five thousand rupees", crore - Tech acronyms β DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD
- Natural Indian-English prosody β accent and rhythm of Indian English speech
Resuming Training / Fine-Tuning Further
The checkpoint includes full optimizer state, so training can be resumed exactly where it left off.
Dataset format
Pipe-delimited CSV, no header, 2 columns:
clip_0001|The total cost is Rs.7,25,000 for the Bengaluru office setup.
clip_0002|Dr. Narayanan will present the quarterly results tomorrow at 4 PM IST.
clip_0003|Please forward the report to admin at example dot com by Friday.
Audio files: wavs/clip_0001.wav, wavs/clip_0002.wav, etc.
- Format: mono WAV, 22050 Hz, 1β24 seconds per clip
Resume training from this checkpoint
# In your train script, point the base model at this checkpoint:
XTTS_CHECKPOINT = "best_model.pth" # this file
XTTS_CONFIG = "config.json" # this file
# Then run the standard XTTS v2 training recipe:
# TTS/recipes/ljspeech/xtts_v2/train_gpt_xtts.py
# with formatter="thorsten" for 2-column id|text metadata
Key config values used during training (from config.json):
| Parameter | Value |
|---|---|
lr |
5e-6 |
batch_size |
2 (per GPU, 2Γ GPUs) |
eval_split_size |
0.1 (10% held out) |
eval_split_max_size |
80 clips |
save_step |
500 |
lr_scheduler |
MultiStepLR |
lr_scheduler milestones |
[3500, 7000, 10000] |
lr_scheduler gamma |
0.5 (halved at each milestone) |
gpt_cond_len |
12 seconds |
gpt_cond_chunk_len |
4 seconds |
gpt_max_audio_tokens |
605 |
gpt_max_text_tokens |
402 |
distributed_backend |
nccl (multi-GPU DDP) |
Tips for further fine-tuning
- Lower the LR when resuming β the model is already converged, use
lr: 1e-6or2e-6 - More data is better β 1000+ clips is the sweet spot; diminishing returns above 5000
- Watch eval mel loss β should stay below 2.7; if it climbs, you're overfitting
- Reference audio matters more than model β use a very clean 8-second clip for best voice cloning results
- OOM during training β reduce
max_wav_lento220500(10 sec) or lowerbatch_sizeto 1
Training History
| Run | Steps | Best eval loss | Notes |
|---|---|---|---|
| Run 1 (Mar 17) | ~1,650 | 2.735 | Initial run, still converging |
| Run 2 (Mar 18) | ~5,085 | 2.794 | Different config, higher loss |
| Run 3 (Mar 18) | 11,074 | 2.697 | This checkpoint β best overall |
| Run 4 (Mar 21) | ~6,885 | 2.829 | Interrupted, worse result |
Run 3 achieved the lowest eval loss across all experiments and is the checkpoint published here.
Credits
- Coqui TTS / XTTS v2 β Base model and training framework (MIT License)
- Thorsten MΓΌller β Dataset format convention (
id|texttwo-column pipe-delimited)
License
The fine-tuned weights follow the Coqui Public Model License (CPML) of the base XTTS v2 model.
- Downloads last month
- 81
Model tree for jeevav62/xtts-v2-indian-en
Base model
coqui/XTTS-v2