Mongolian VITS — My-Voice Fine-Tune

Speaker-adapted fine-tune of Bokhbat/mongolian-vits-tts: the multi-speaker Mongolian VITS model with one new voice (speaker01) added, without degrading the original Mongolian ability.

  • Base: multi-speaker VITS, 78 Mongolian speakers
  • This model: 79 speakers = original 78 (ids 0–77, unchanged) + speaker01 (id 78, the new voice)
  • Adaptation data: ~3.7 min (57 clips), single speaker
  • Best checkpoint: epoch 93 / step 609 (eval-loss best, early-stopped at plateau)
  • Sample rate: 22050 Hz

How Mongolian ability was protected (Strategy A)

  • Original 78 speaker ids preserved; new voice appended as id 78 (so the speaker embedding table was expanded, not overwritten).
  • text_encoder (phonetics/text) and duration_predictor (rhythm/prosody) were frozen — the language model cannot drift on the small dataset.
  • Low LR 2e-5 (base used 2e-4) + eval-based best-model selection.

The original 78 voices still synthesize full natural Mongolian; speaker01 is the newly learned voice. Note: 3.7 min is very little data — speaker01 is recognizable but rough; more data would sharpen it.

Files

File Description
best_model.pth Fine-tuned VITS checkpoint (79 speakers)
config.json Coqui TTS config
speakers.pth 79-speaker name→id map (speaker01 = 78)
tensorboard/ Fine-tune training curves
ft_yourvoice_spk01.wav Sample: new voice (speaker01)
ft_original_spk0053.wav Sample: an original voice (spk_0053), Mongolian-ability check

Usage

from huggingface_hub import hf_hub_download
from TTS.utils.synthesizer import Synthesizer

repo = "Bokhbat/mongolian-vits-myvoice"
ckpt  = hf_hub_download(repo, "best_model.pth")
cfg   = hf_hub_download(repo, "config.json")
spk   = hf_hub_download(repo, "speakers.pth")

syn = Synthesizer(ckpt, cfg, tts_speakers_file=spk, use_cuda=False)
# the new voice:
wav = syn.tts("Сайн байна уу?", speaker_name="speaker01")
syn.save_wav(wav, "myvoice.wav")
# an original Mongolian voice still works:
wav = syn.tts("Сайн байна уу?", speaker_name="spk_0053")
syn.save_wav(wav, "original.wav")
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support