suleiman2003/final_corrected_hausa_dataset
Viewer • Updated • 9.51k • 37
Multi-speaker text-to-speech model for Hausa language based on the VITS architecture (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech).
Built on the VITS architecture with multi-speaker conditioning via learned speaker embeddings (256-dim).
nn.Embedding(n_speakers, 256) — learned lookupaudio, text, speaker_id, gendertrain split → automatically split 95/5 for train/val| Component | Details |
|---|---|
| Architecture | VITS (multi-speaker) |
| Language | Hausa (hau) |
| Sampling Rate | 16,000 Hz |
| Hidden Size | 192 |
| FFN Dim | 768 |
| Attention Heads | 2 |
| Transformer Layers | 6 |
| Speaker Embedding Dim | 256 |
| Generator Params | ~40M |
| Discriminator Params | ~47M |
| Upsample Rates | [8, 8, 2, 2] |
| Hyperparameter | Value |
|---|---|
| Learning Rate | 2e-4 |
| Optimizer | AdamW (β₁=0.8, β₂=0.99) |
| LR Schedule | ExponentialLR (γ=0.999875) |
| Batch Size | 16 |
| Epochs | 200 |
| Mel Loss Weight | 45 |
| KL Loss Weight | 1.0 |
| FP16 | ✅ |
| Segment Size | 8192 samples |
| Max Audio Length | 10 seconds |
The model is trained with 5 losses following the original VITS paper:
pip install -r requirements.txt
# Single GPU
python train_hausa_tts.py
# The script will:
# 1. Load the dataset from HuggingFace Hub
# 2. Auto-detect speakers and remap IDs
# 3. Split into train/val (95/5)
# 4. Load text encoder from facebook/mms-tts-hau
# 5. Train with full VITS losses
# 6. Push final model to Hub
# Recommended: A10G (24GB VRAM) or better
huggingface-cli jobs run train_hausa_tts.py \
--hardware a10g-large \
--timeout 8h \
--dependencies torch torchaudio transformers datasets librosa trackio huggingface_hub soundfile numpy scipy
import torch
import json
# Load the trained model
checkpoint = torch.load("generator.pth", map_location="cpu")
config = json.load(open("config.json"))
# Rebuild the model (copy SynthesizerTrn class from train_hausa_tts.py)
# Or import it:
# from train_hausa_tts import SynthesizerTrn
model = SynthesizerTrn(
n_vocab=config["vocab_size"],
spec_channels=config["n_fft"] // 2 + 1,
segment_size=config["segment_size"] // config["hop_length"],
inter_channels=config["inter_channels"],
hidden_channels=config["hidden_size"],
filter_channels=config["filter_channels"],
n_heads=config["n_heads"],
n_layers=config["n_layers"],
kernel_size=config["kernel_size"],
p_dropout=config["p_dropout"],
resblock_kernel_sizes=config["resblock_kernel_sizes"],
resblock_dilation_sizes=config["resblock_dilation_sizes"],
upsample_rates=config["upsample_rates"],
upsample_initial_channel=config["upsample_initial_channel"],
upsample_kernel_sizes=config["upsample_kernel_sizes"],
n_speakers=config["n_speakers"],
gin_channels=config["gin_channels"],
use_sdp=True,
)
model.load_state_dict(checkpoint["model"])
model.eval()
# Tokenize text
vocab = config["vocab"]
def text_to_ids(text, vocab, add_blank=True):
text = text.lower().strip()
ids = [vocab[c] for c in text if c in vocab]
if add_blank:
new_ids = [0] * (len(ids) * 2 + 1)
new_ids[1::2] = ids
ids = new_ids
return ids
# Generate speech
text = "Sannu da zuwa"
text_ids = torch.LongTensor([text_to_ids(text, vocab)])
text_lengths = torch.LongTensor([text_ids.size(1)])
speaker_id = torch.LongTensor([0]) # Choose speaker
with torch.no_grad():
audio, _, _, _ = model.infer(text_ids, text_lengths, speaker_id)
audio = audio.squeeze().cpu().numpy()
# Save
import soundfile as sf
sf.write("output.wav", audio, 16000)
@inproceedings{kim2021conditional,
title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
booktitle={International Conference on Machine Learning},
year={2021}
}
@article{pratap2023scaling,
title={Scaling Speech Technology to 1,000+ Languages},
author={Pratap, Vineel and others},
journal={arXiv preprint arXiv:2305.13516},
year={2023}
}
CC-BY-NC-4.0 (following the base model license)
Base model
facebook/mms-tts-hau