Sourashtra VITS TTS Models
VITS text-to-speech models for the Sourashtra language (ISO 639-3: saz), a minority Indo-Aryan language spoken primarily in Tamil Nadu, India. Trained using Coqui TTS on a custom annotated speech corpus.
Four variants: 2 speakers (male, female) × 2 input scripts (Tamil script, Sourashtra script).
Models
| Folder | Speaker | Input Script | Training Steps |
|---|---|---|---|
Sourashtra-Male_Script-tamil |
Male | Tamil (தமிழ்) | 300,000 |
Sourashtra-Male_Script-sourashtra |
Male | Sourashtra (ꢪꢾꢥꢶꢒ) | 300,000 |
Sourashtra-Female_Script-tamil |
Female | Tamil (தமிழ்) | 340,000 |
Sourashtra-Female_Script-sourashtra |
Female | Sourashtra (ꢪꢾꢥꢶꢒ) | 340,000 |
Each folder contains best_model.pth, config.json, inference.py, and requirements.txt.
Setup
pip install -r requirements.txt
For GPU inference, install the CUDA-enabled PyTorch build matching your driver first — see pytorch.org.
Usage
Run inference.py from inside the model folder:
# Male — Tamil script
cd Sourashtra-Male_Script-tamil
python inference.py "சொராஷ்ட்ர மொழி" -o output.wav
# Male — Sourashtra script
cd Sourashtra-Male_Script-sourashtra
python inference.py "ꢪꢾꢥꢶꢒ ꢪꢒꢡ" -o output.wav
# Female — Tamil script
cd Sourashtra-Female_Script-tamil
python inference.py "சொராஷ்ட்ர மொழி" -o output.wav
# Female — Sourashtra script
cd Sourashtra-Female_Script-sourashtra
python inference.py "ꢪꢾꢥꢶꢒ ꢪꢒꢡ" -o output.wav
Use --gpu <id> to select a GPU, or --cpu to force CPU inference.
Script Notes
The Tamil-script and Sourashtra-script models produce speech from the same speaker — only the input orthography differs. Choose based on your text source.
- Tamil script models — strip
:,.,'and apply NFC normalization automatically - Sourashtra script models — strip Sourashtra Danda (꣎) and Double Danda (꣏) automatically
Training
| Parameter | Value |
|---|---|
| Architecture | VITS (end-to-end, flow-based) |
| Sample rate | 22050 Hz |
| Mel bins | 80 |
| Batch size | 16 |
| Mixed precision | Yes |
| Phonemes | No (character-level) |
Male training data: ~9,800–10,000 utterances. Female training data: ~11,400 utterances.