πŸ—£οΈ Sinhala TTS VITS πŸ‡±πŸ‡°

Sinhala Text-to-Speech β€” A Coqui TTS VITS model that generates natural Sinhala speech from text, with 16 distinct voices to choose from.

🎯 Model Details

Attribute Value
Architecture VITS (Variational Inference Text-to-Speech)
Language πŸ‡±πŸ‡° Sinhala (ΰ·ƒΰ·’ΰΆ‚ΰ·„ΰΆ½)
Speakers 16 voices
Sample Rate 16 kHz
Parameters ~30M
Vocab 97 characters (74 Sinhala Unicode + 19 punctuation + 4 special tokens)
Framework Coqui TTS 0.27.x
License Apache 2.0
Model Format SafeTensors (.safetensors)

πŸ—£οΈ Available Speakers

ID Speaker Name Description
0 mettananda Male voice 1
1 oshadi Female voice 1
2 pn_sin_01 Voice 3
3 sin_01 Voice 4
4 sin_2241 Voice 5
5 sin_2282 Voice 6
6 sin_3531 Voice 7
7 sin_3688 Voice 8
8 sin_3976 Voice 9
9 sin_4191 Voice 10
10 sin_4499 Voice 11
11 sin_5681 Voice 12
12 sin_6314 Voice 13
13 sin_6897 Voice 14
14 sin_7183 Voice 15
15 sin_9228 Voice 16

πŸš€ Usage

Option 1: Coqui TTS (Recommended)

import torch
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.models.vits import Vits
from TTS.tts.utils.text import TTSTokenizer
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.audio import AudioProcessor

# Load config
config = VitsConfig()
config.load_json("config.json")

# Initialize components
ap = AudioProcessor.init_from_config(config)
tokenizer, new_config = TTSTokenizer.init_from_config(config)
speaker_manager = SpeakerManager()
speaker_manager.load_ids_from_file("speakers.json")

# Create and load model
model = Vits(new_config, ap, tokenizer, speaker_manager)
from safetensors.torch import load_file
state_dict = load_file("sinhala_tts_vits_model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()

# Synthesize
text = "ΰΆ†ΰΆΊΰ·”ΰΆΆΰ·ΰ·€ΰΆ±ΰ·Š! ΰΆ”ΰΆΆΰΆ§ ΰΆšΰ·œΰ·„ΰ·œΰΆΈΰΆ―?"
outputs = model.synthesize(text, config=new_config, speaker="mettananda")

# Save audio
import soundfile as sf
sf.write("output.wav", outputs["wav"], 16000)

Option 2: REST API (with included server.py)

# Start the server
python server.py

# Generate speech
curl -X POST http://localhost:8081/tts \
  -H "Content-Type: application/json" \
  -d '{
    "text": "ΰΆ†ΰΆΊΰ·”ΰΆΆΰ·ΰ·€ΰΆ±ΰ·Š!",
    "speaker": "mettananda",
    "emotion": "neutral"
  }' \
  --output output.wav

# Health check
curl http://localhost:8081/health

# List speakers
curl http://localhost:8081/speakers

Option 3: HuggingFace Inference API

⚠️ This model uses Coqui TTS (not Transformers) and cannot be used via the standard HF Inference API. Use Coqui TTS directly or the included REST API server.

Option 4: Docker Deployment

docker build -t sinhala-tts-server .
docker run -p 8081:8081 sinhala-tts-server

πŸ› οΈ Development Platforms

Platform GPU Cost Best For
Kaggle P100/T4 Free (~30 hrs/week) Quick experiments
Colab T4/A100 Free / $10/mo Pro Training runs
Modal A100 80GB $20 free credit Full training
RunPod RTX 4090/A100 $0.34–$2.00/hr Production

πŸ“¦ Files

File Description Size
sinhala_tts_vits_model.safetensors Model weights (SafeTensors) 316 MB
config.json Model configuration 8 KB
speakers.json Speaker ID mapping 300 B
server.py FastAPI REST inference server 6 KB
Dockerfile Docker build for production 2 KB
DEVELOPER_GUIDE.md Training & development guide 15 KB

πŸŽ“ Training & Fine-Tuning

For detailed instructions, see the DEVELOPER_GUIDE.md which covers:

  • Setup: Environment configuration and dependency installation
  • Training from scratch: Full training pipeline with the Sinhala dataset
  • Fine-tuning: Adapting the model to new voices or domains
  • Dataset preparation: Preprocessing Sinhala audio data
  • Export to SafeTensors: Converting PyTorch checkpoints to SafeTensors format
  • Cloud GPU training: Step-by-step guides for Kaggle, Colab, and Modal

🌐 Deployment Options

Method Description Best For
HuggingFace Space Gradio web UI (live demo) Quick testing
FastAPI Server REST API with Docker Production APIs
Local Python Direct model loading Development
Kubernetes Docker container in K8s Scalable deployment

⚠️ Limitations

  • Audio quality: Trained on a limited dataset (~200 samples Γ— 16 speakers) β€” quality may vary
  • Inference speed: CPU inference is slower; GPU recommended for production
  • Emotion control: Basic emotion prefixes are supported but effects are subtle
  • Proper nouns: May struggle with non-Sinhala words or names
  • Out-of-vocabulary characters: Limited to the 93-character vocabulary

πŸ“ License

This model is released under the Apache 2.0 License.

πŸ™ Maintainer

Death Legion Team β€” πŸ€— HuggingFace


🎧 Try the Live Demo β€’ πŸ“– Developer Guide β€’ 🏠 Death Legion Team

Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Space using deathlegionteam/sinhala-tts-vits 1