openslr/openslr
Updated • 520 • 29
A multi-speaker Nepali text-to-speech model in the Piper / piper-plus format. 419 speaker embeddings, ~22 hours of training audio, custom 65-phone Nepali inventory (NOT eSpeak).
| Architecture | VITS (multi-speaker), 77.5 M parameters |
| Phoneme inventory | Project-internal 65-phone Nepali (Khatiwada 2009) |
| Speaker count | 419 |
| Sample rate | 22 050 Hz |
| Audio quality | 22 kHz medium |
| Base | ayousanz/piper-plus-base (multilingual, 6 languages) |
| Fine-tune steps | ~130 800 (v2 600 epochs + v3b 200 epochs) |
| License | CC-BY-SA-4.0 (forced by training-data licenses) |
| speaker_id | label | training utterances | notes |
|---|---|---|---|
| 399 | slr143_F |
554 | Cleanest studio female. Default recommendation. |
| 403 | slr43_0546 |
505 | Alternate clean female (different timbre). |
| 406 | slr43_2099 |
275 | Alternate clean female. |
| 400 | slr143_M |
108 | Male reference. Smaller training set, voice less stable. |
| 398 | algenib |
1984 | Synthetic teacher (Gemini-Flash). Under-trained at this checkpoint. |
For other speaker IDs (IV-R crowdsourced, additional SLR43 voices), see dataset.jsonl for the full
mapping. Quality varies; the four IDs above are the curated production set.
# 1. Install piper-plus (the trainer/inference fork we used)
# 2. Install our G2P frontend (the phoneme producer; required — eSpeak is NOT compatible)
import json, torch
from piper_train.vits import VitsModel
from nepali_frontend.g2p import phonemizer as ph
model = VitsModel.load_from_checkpoint("model.ckpt", dataset=None).cuda().eval()
config = json.load(open("config.json"))
PIM = config["phoneme_id_map"]
def to_ids(sentence: str) -> list[int]:
out = [1] # BOS
for w in ph.phonemize_text(sentence):
for p in w.phones:
if p == "|":
continue
out.extend(PIM.get(p, []))
out.append(2) # EOS
return out
text = "नेपाल हाम्रो देश हो।"
ids = torch.LongTensor(to_ids(text)).unsqueeze(0).cuda()
text_lengths = torch.LongTensor([ids.size(1)]).cuda()
sid = torch.LongTensor([399]).cuda() # slr143_F
audio = model(ids, text_lengths, scales=[0.667, 1.0, 0.0], sid=sid).cpu().numpy()
| source | hours | utterances | speakers | license |
|---|---|---|---|---|
| AI4Bharat IndicVoices-R Nepali | 13.74h | 5598 | 401 | CC-BY-4.0 |
| OpenSLR SLR143 (M+F TTS) | 1.24h | 662 | 2 | CC-BY-SA-4.0 |
| OpenSLR SLR43 (multi-speaker female TTS) | 2.80h | 2064 | 18 | CC-BY-SA-4.0 |
| Gemini-Flash Algenib (synthetic teacher) | 4.47h | 1984 | 1 | Synthetic, public-release consent |
| Total | ~22h | ~10 200 | 419 | CC-BY-SA-4.0 (most restrictive) |
ts i n / p ax s . ts i m / final r after h ax) are
underlearned — the model fumbles certain words like चीन, पश्चिम, सहर. These contexts
appear ~120-170 times in training, which is in the marginal zone for VITS articulation learning./ts/). Native speakers may
perceive Devanagari च as the more familiar /tʃ/ ("ch") sound; this is a transcription-policy
decision baked into the phoneme inventory, not a model defect.ि and ी both map to i per Khatiwada policy.The model is released under CC-BY-SA-4.0 (Attribution-ShareAlike 4.0 International), the most restrictive license among the training datasets. If you redistribute or build on this model, your work must also be ShareAlike-licensed.
@misc{nepali_voices_v0_2026,
title = {Nepali Voices v0: Multi-speaker Piper-VITS for Nepali},
author = {Ampixa},
year = {2026},
url = {https://huggingface.co/ampixa/nepali-voices-v0}
}