You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Khasi OmniVoice TTS 2

Khasi OmniVoice TTS 2 (toiar/Khasi-OmniVoice-TTS-2) is a Text-to-Speech (TTS) model designed to synthesize high-quality, natural-sounding speech in the Khasi language. It is specifically trained to handle realistic, modern Khasi speech patterns, including natural Khasi-English code-switching.

Model Details

Model Type: Text-to-Speech (TTS)
Language(s): Khasi (kha), English (en)
Developer: toiar

Training Data

This model was trained exclusively on the Khasi Omni Voice TTS Dataset.

Total Audio Duration: ~50 hours (49h 53m)
Sample Count: 18,874 distinct utterances
Demographics: Voices representing teenagers and young adults (ages 13 to 25).
Audio Quality: The training data underwent rigorous pre-processing to remove background noise, clicks, and environmental artifacts, followed by vocal enhancement for optimal clarity.

Uses

Direct Use

This model is intended for researchers, developers, and educators working on accessibility tools, educational software, or conversational AI for the Khasi-speaking population of Meghalaya and beyond.

Limitations and Bias

Khasi Text Normalization: While the underlying model architecture may support automatic text normalization for higher-resource languages like English, it lacks this capability for the Khasi language due to limited training data. As a result, digits and symbols (like "1") will not be synthesized correctly and must be explicitly spelled out in the text.
Code-Switching Limitations: While the model natively handles Khasi-English code-switching, its capabilities are limited. Complex sentences may not sound as natural as native English TTS models.
Voice Cloning Constraints: If the model is used for voice cloning, the generated audio may default heavily to the characteristics of the training dataset, struggling to fully capture novel target voices outside its training distribution.
Low-Resource Constraints: As Khasi is a low-resource language, the model might occasionally mispronounce rare dialectal words or highly specific localized terminology.

Acknowledgements

This model is a fine-tuned version of the OmniVoice architecture. The original OmniVoice model introduced a massively multilingual zero-shot TTS framework supporting over 600 languages. This work adapts and fine-tunes OmniVoice for Khasi speech synthesis using a dedicated Khasi speech corpus.

We gratefully acknowledge the authors of OmniVoice for releasing the base model and research that made this work possible.

Base Model

Base Model: OmniVoice
Original Repository: OmniVoice GitHub Repository
Original Paper: OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Citation

If you use the original OmniVoice architecture, please cite:

@article{zhu2026omnivoice,
  title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
  author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
  journal={arXiv preprint arXiv:2604.00688},
  year={2026}
}

If you use Khasi OmniVoice TTS 2, please cite:

@misc{toiar2026khasiomnivoicetts2,
  title={Khasi OmniVoice TTS 2},
  author={Toiarbor Mawlieh},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/toiar/Khasi-OmniVoice-TTS-2}}
}

Downloads last month: 38

Safetensors

Model size

0.6B params

Tensor type

I64

F32

Space using toiar/Khasi-OmniVoice-TTS-2 1

Paper for toiar/Khasi-OmniVoice-TTS-2

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Paper • 2604.00688 • Published Apr 1 • 17