Nepali Voices v0 (Piper-VITS, 419 speakers)

A multi-speaker Nepali text-to-speech model in the Piper / piper-plus format. 419 speaker embeddings, ~22 hours of training audio, custom 65-phone Nepali inventory (NOT eSpeak).

At a glance

Architecture VITS (multi-speaker), 77.5 M parameters
Phoneme inventory Project-internal 65-phone Nepali (Khatiwada 2009)
Speaker count 419
Sample rate 22 050 Hz
Audio quality 22 kHz medium
Base ayousanz/piper-plus-base (multilingual, 6 languages)
Fine-tune steps ~130 800 (v2 600 epochs + v3b 200 epochs)
License CC-BY-SA-4.0 (forced by training-data licenses)

Recommended speakers for production inference

speaker_id label training utterances notes
399 slr143_F 554 Cleanest studio female. Default recommendation.
403 slr43_0546 505 Alternate clean female (different timbre).
406 slr43_2099 275 Alternate clean female.
400 slr143_M 108 Male reference. Smaller training set, voice less stable.
398 algenib 1984 Synthetic teacher (Gemini-Flash). Under-trained at this checkpoint.

For other speaker IDs (IV-R crowdsourced, additional SLR43 voices), see dataset.jsonl for the full mapping. Quality varies; the four IDs above are the curated production set.

Quick start

# 1. Install piper-plus (the trainer/inference fork we used)
# 2. Install our G2P frontend (the phoneme producer; required — eSpeak is NOT compatible)
import json, torch
from piper_train.vits import VitsModel
from nepali_frontend.g2p import phonemizer as ph

model = VitsModel.load_from_checkpoint("model.ckpt", dataset=None).cuda().eval()
config = json.load(open("config.json"))
PIM = config["phoneme_id_map"]

def to_ids(sentence: str) -> list[int]:
    out = [1]  # BOS
    for w in ph.phonemize_text(sentence):
        for p in w.phones:
            if p == "|":
                continue
            out.extend(PIM.get(p, []))
    out.append(2)  # EOS
    return out

text = "नेपाल हाम्रो देश हो।"
ids = torch.LongTensor(to_ids(text)).unsqueeze(0).cuda()
text_lengths = torch.LongTensor([ids.size(1)]).cuda()
sid = torch.LongTensor([399]).cuda()  # slr143_F
audio = model(ids, text_lengths, scales=[0.667, 1.0, 0.0], sid=sid).cpu().numpy()

Training data

source hours utterances speakers license
AI4Bharat IndicVoices-R Nepali 13.74h 5598 401 CC-BY-4.0
OpenSLR SLR143 (M+F TTS) 1.24h 662 2 CC-BY-SA-4.0
OpenSLR SLR43 (multi-speaker female TTS) 2.80h 2064 18 CC-BY-SA-4.0
Gemini-Flash Algenib (synthetic teacher) 4.47h 1984 1 Synthetic, public-release consent
Total ~22h ~10 200 419 CC-BY-SA-4.0 (most restrictive)

What this model does well

  • Renders common Nepali phonotactic patterns cleanly across the production speakers.
  • Distinguishes Nepali-specific contrasts: aspiration, retroflex/dental, oral/nasal vowels, gemination.
  • Handles natural prosody on Wikipedia-style and conversational sentences.

Known limitations

  • Rare phoneme contexts (e.g. ts i n / p ax s . ts i m / final r after h ax) are underlearned — the model fumbles certain words like चीन, पश्चिम, सहर. These contexts appear ~120-170 times in training, which is in the marginal zone for VITS articulation learning.
  • /ts/ vs /tʃ/ for च — this model follows Khatiwada 2009 (/ts/). Native speakers may perceive Devanagari as the more familiar /tʃ/ ("ch") sound; this is a transcription-policy decision baked into the phoneme inventory, not a model defect.
  • No phonemic vowel lengthि and both map to i per Khatiwada policy.
  • English / mixed-script input is not supported. The G2P drops Latin-script tokens silently.

License

The model is released under CC-BY-SA-4.0 (Attribution-ShareAlike 4.0 International), the most restrictive license among the training datasets. If you redistribute or build on this model, your work must also be ShareAlike-licensed.

Citation

@misc{nepali_voices_v0_2026,
  title  = {Nepali Voices v0: Multi-speaker Piper-VITS for Nepali},
  author = {Ampixa},
  year   = {2026},
  url    = {https://huggingface.co/ampixa/nepali-voices-v0}
}

Acknowledgements

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ampixa/nepali-voices-v0