Nepali Voices v0 (Piper-VITS, 419 speakers)

A multi-speaker Nepali text-to-speech model in the Piper / piper-plus format. 419 speaker embeddings, ~22 hours of training audio, custom 65-phone Nepali inventory (NOT eSpeak).

At a glance


Architecture	VITS (multi-speaker), 77.5 M parameters
Phoneme inventory	Project-internal 65-phone Nepali (Khatiwada 2009)
Speaker count	419
Sample rate	22 050 Hz
Audio quality	22 kHz medium
Base	`ayousanz/piper-plus-base` (multilingual, 6 languages)
Fine-tune steps	~130 800 (v2 600 epochs + v3b 200 epochs)
License	CC-BY-SA-4.0 (forced by training-data licenses)

Recommended speakers for production inference

speaker_id	label	training utterances	notes
399	`slr143_F`	554	Cleanest studio female. Default recommendation.
403	`slr43_0546`	505	Alternate clean female (different timbre).
406	`slr43_2099`	275	Alternate clean female.
400	`slr143_M`	108	Male reference. Smaller training set, voice less stable.
398	`algenib`	1984	Synthetic teacher (Gemini-Flash). Under-trained at this checkpoint.

For other speaker IDs (IV-R crowdsourced, additional SLR43 voices), see dataset.jsonl for the full mapping. Quality varies; the four IDs above are the curated production set.

Quick start

# 1. Install piper-plus (the trainer/inference fork we used)
# 2. Install our G2P frontend (the phoneme producer; required — eSpeak is NOT compatible)
import json, torch
from piper_train.vits import VitsModel
from nepali_frontend.g2p import phonemizer as ph

model = VitsModel.load_from_checkpoint("model.ckpt", dataset=None).cuda().eval()
config = json.load(open("config.json"))
PIM = config["phoneme_id_map"]

def to_ids(sentence: str) -> list[int]:
    out = [1]  # BOS
    for w in ph.phonemize_text(sentence):
        for p in w.phones:
            if p == "|":
                continue
            out.extend(PIM.get(p, []))
    out.append(2)  # EOS
    return out

text = "नेपाल हाम्रो देश हो।"
ids = torch.LongTensor(to_ids(text)).unsqueeze(0).cuda()
text_lengths = torch.LongTensor([ids.size(1)]).cuda()
sid = torch.LongTensor([399]).cuda()  # slr143_F
audio = model(ids, text_lengths, scales=[0.667, 1.0, 0.0], sid=sid).cpu().numpy()

Training data

source	hours	utterances	speakers	license
AI4Bharat IndicVoices-R Nepali	13.74h	5598	401	CC-BY-4.0
OpenSLR SLR143 (M+F TTS)	1.24h	662	2	CC-BY-SA-4.0
OpenSLR SLR43 (multi-speaker female TTS)	2.80h	2064	18	CC-BY-SA-4.0
Gemini-Flash Algenib (synthetic teacher)	4.47h	1984	1	Synthetic, public-release consent
Total	~22h	~10 200	419	CC-BY-SA-4.0 (most restrictive)

What this model does well

Renders common Nepali phonotactic patterns cleanly across the production speakers.
Distinguishes Nepali-specific contrasts: aspiration, retroflex/dental, oral/nasal vowels, gemination.
Handles natural prosody on Wikipedia-style and conversational sentences.

Known limitations

Rare phoneme contexts (e.g. ts i n / p ax s . ts i m / final r after h ax) are underlearned — the model fumbles certain words like चीन, पश्चिम, सहर. These contexts appear ~120-170 times in training, which is in the marginal zone for VITS articulation learning.
/ts/ vs /tʃ/ for च — this model follows Khatiwada 2009 (/ts/). Native speakers may perceive Devanagari च as the more familiar /tʃ/ ("ch") sound; this is a transcription-policy decision baked into the phoneme inventory, not a model defect.
No phonemic vowel length — ि and ी both map to i per Khatiwada policy.
English / mixed-script input is not supported. The G2P drops Latin-script tokens silently.

License

The model is released under CC-BY-SA-4.0 (Attribution-ShareAlike 4.0 International), the most restrictive license among the training datasets. If you redistribute or build on this model, your work must also be ShareAlike-licensed.

Citation

@misc{nepali_voices_v0_2026,
  title  = {Nepali Voices v0: Multi-speaker Piper-VITS for Nepali},
  author = {Ampixa},
  year   = {2026},
  url    = {https://huggingface.co/ampixa/nepali-voices-v0}
}

Acknowledgements

piper-plus (training stack)
ayousanz/piper-plus-base (multilingual base)
OpenSLR SLR43, SLR143 (audio corpora)
AI4Bharat IndicVoices-R (audio corpus)
Khatiwada, R. (2009). Nepali. Journal of the International Phonetic Association. (phonology source)

Downloads last month: 8

ampixa
/

nepali-voices-v0