XTTS v2 β€” Indian English Fine-Tune

GitHub Recipe Base Model Language

πŸ”— Recipe: xttsv2-recipe

A fine-tuned version of Coqui XTTS v2 adapted for Indian-accented English speech synthesis.

XTTS v2 is a 518M-parameter GPT-based multilingual TTS model with zero-shot voice cloning. This fine-tune improves naturalness, prosody, and pronunciation for Indian-English speakers and vocabulary β€” Indian names, Indian city names, lakh/crore number system, and tech acronyms.


Checkpoint Info

Detail Value
Best step 11,074
Total steps trained 11,250
Best eval loss 2.697 (down from 3.766 at start)
Eval mel loss 2.670 (βˆ’1.065)
Eval text loss 0.027 (βˆ’0.003)

The checkpoint (best_model.pth) is a full training checkpoint β€” it contains model weights AND optimizer state. This means you can both run inference AND resume training from it.

  • Model weights only: ~2 GB
  • With optimizer state (Adam m + v buffers): 5.3 GB total

Files

File Size Description
best_model.pth 5.3 GB Full training checkpoint (weights + optimizer state)
config.json 6 KB Training config (required for inference and resuming training)

Additional files required for inference (not stored here β€” auto-downloaded by the TTS library on first use):

File Source
vocab.json Downloaded from Coqui's servers automatically
dvae.pth Downloaded from Coqui's servers automatically
mel_stats.pth Downloaded from Coqui's servers automatically

Inference

Install

pip install TTS>=0.22.0 torch>=2.1 torchaudio>=2.1

Basic inference (voice cloning)

from TTS.api import TTS

tts = TTS(model_path="best_model.pth", config_path="config.json").to("cuda")

tts.tts_to_file(
    text="The meeting is on 23rd April at 4:30 PM IST. Dr. Narayanan from Chennai will present.",
    speaker_wav="reference.wav",   # 5-10 sec clean WAV of the target speaker at 24 kHz
    language="en",
    file_path="output.wav"
)

Inference with the XTTS model directly

import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Load config and model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="best_model.pth", eval=True)
model.cuda()

# Get speaker conditioning from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["reference.wav"]
)

# Synthesize
out = model.inference(
    text="Dr. Narayanan Subramanian from Chennai will meet Aravind Sridhar tomorrow.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
    repetition_penalty=2.5,
    top_k=50,
    top_p=0.85,
)

import soundfile as sf
sf.write("output.wav", out["wav"], 24000)

Reference audio requirements

  • Format: WAV, mono
  • Sample rate: 24 kHz (22050 Hz also works β€” TTS library resamples)
  • Duration: 5–10 seconds (longer does not help)
  • Quality: Clean, no background noise, no music
  • Same speaker as your target voice

What it handles well

  • Indian names β€” Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar
  • Indian cities β€” Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai
  • Indian number system β€” Rs.7,25,000 β†’ "seven lakh twenty five thousand rupees", crore
  • Tech acronyms β€” DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD
  • Natural Indian-English prosody β€” accent and rhythm of Indian English speech

Resuming Training / Fine-Tuning Further

The checkpoint includes full optimizer state, so training can be resumed exactly where it left off.

Dataset format

Pipe-delimited CSV, no header, 2 columns:

clip_0001|The total cost is Rs.7,25,000 for the Bengaluru office setup.
clip_0002|Dr. Narayanan will present the quarterly results tomorrow at 4 PM IST.
clip_0003|Please forward the report to admin at example dot com by Friday.

Audio files: wavs/clip_0001.wav, wavs/clip_0002.wav, etc.

  • Format: mono WAV, 22050 Hz, 1–24 seconds per clip

Resume training from this checkpoint

# In your train script, point the base model at this checkpoint:
XTTS_CHECKPOINT = "best_model.pth"     # this file
XTTS_CONFIG     = "config.json"        # this file

# Then run the standard XTTS v2 training recipe:
# TTS/recipes/ljspeech/xtts_v2/train_gpt_xtts.py
# with formatter="thorsten" for 2-column id|text metadata

Key config values used during training (from config.json):

Parameter Value
lr 5e-6
batch_size 2 (per GPU, 2Γ— GPUs)
eval_split_size 0.1 (10% held out)
eval_split_max_size 80 clips
save_step 500
lr_scheduler MultiStepLR
lr_scheduler milestones [3500, 7000, 10000]
lr_scheduler gamma 0.5 (halved at each milestone)
gpt_cond_len 12 seconds
gpt_cond_chunk_len 4 seconds
gpt_max_audio_tokens 605
gpt_max_text_tokens 402
distributed_backend nccl (multi-GPU DDP)

Tips for further fine-tuning

  • Lower the LR when resuming β€” the model is already converged, use lr: 1e-6 or 2e-6
  • More data is better β€” 1000+ clips is the sweet spot; diminishing returns above 5000
  • Watch eval mel loss β€” should stay below 2.7; if it climbs, you're overfitting
  • Reference audio matters more than model β€” use a very clean 8-second clip for best voice cloning results
  • OOM during training β€” reduce max_wav_len to 220500 (10 sec) or lower batch_size to 1

Training History

Run Steps Best eval loss Notes
Run 1 (Mar 17) ~1,650 2.735 Initial run, still converging
Run 2 (Mar 18) ~5,085 2.794 Different config, higher loss
Run 3 (Mar 18) 11,074 2.697 This checkpoint β€” best overall
Run 4 (Mar 21) ~6,885 2.829 Interrupted, worse result

Run 3 achieved the lowest eval loss across all experiments and is the checkpoint published here.


Credits


License

The fine-tuned weights follow the Coqui Public Model License (CPML) of the base XTTS v2 model.

Downloads last month
81
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jeevav62/xtts-v2-indian-en

Base model

coqui/XTTS-v2
Finetuned
(69)
this model