Hausa Multi-Speaker VITS TTS

Multi-speaker text-to-speech model for Hausa language based on the VITS architecture (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech).

Architecture

Built on the VITS architecture with multi-speaker conditioning via learned speaker embeddings (256-dim).

What's transferred from facebook/mms-tts-hau:

✅ Text encoder (6-layer Transformer): Hausa phoneme/character knowledge
✅ Vocabulary (34 characters): Hausa-specific character set including ƙ, ɓ, ɗ, ā, etc.

What's trained from scratch (NO voice leakage):

🆕 Speaker embedding table: nn.Embedding(n_speakers, 256) — learned lookup
🆕 HiFi-GAN decoder: Waveform generator with speaker conditioning
🆕 Posterior encoder: 16-layer WaveNet with speaker conditioning
🆕 Normalizing flow: 4 coupling layers with speaker conditioning
🆕 Stochastic duration predictor: With speaker conditioning
🆕 Multi-Period Discriminator: 6 sub-discriminators for adversarial training

Key Design Decisions:

No original voice leakage: Only the text encoder (speaker-independent by design in VITS) is loaded from the pretrained model. All voice-producing components are trained from scratch.
Speaker conditioning everywhere: Speaker embeddings are injected into decoder, posterior encoder, flow, and duration predictor via global conditioning.
Text encoder frozen: Preserves Hausa linguistic knowledge learned from MMS pretraining.

Training Data

Dataset: suleiman2003/final_corrected_hausa_dataset
Columns: audio, text, speaker_id, gender
Split handling: Original dataset has only train split → automatically split 95/5 for train/val

Model Specs

Component	Details
Architecture	VITS (multi-speaker)
Language	Hausa (hau)
Sampling Rate	16,000 Hz
Hidden Size	192
FFN Dim	768
Attention Heads	2
Transformer Layers	6
Speaker Embedding Dim	256
Generator Params	~40M
Discriminator Params	~47M
Upsample Rates	[8, 8, 2, 2]

Training Configuration

Hyperparameter	Value
Learning Rate	2e-4
Optimizer	AdamW (β₁=0.8, β₂=0.99)
LR Schedule	ExponentialLR (γ=0.999875)
Batch Size	16
Epochs	200
Mel Loss Weight	45
KL Loss Weight	1.0
FP16	✅
Segment Size	8192 samples
Max Audio Length	10 seconds

Losses

The model is trained with 5 losses following the original VITS paper:

Mel reconstruction loss (L1, weight=45): Ensures spectral fidelity
KL divergence loss (weight=1.0): Aligns prior and posterior distributions
Adversarial loss: Least-squares GAN for realistic waveforms
Feature matching loss: Matches discriminator intermediate features
Duration loss: Stochastic duration predictor negative log-likelihood

How to Train

Prerequisites

pip install -r requirements.txt

Run Training

# Single GPU
python train_hausa_tts.py

# The script will:
# 1. Load the dataset from HuggingFace Hub
# 2. Auto-detect speakers and remap IDs
# 3. Split into train/val (95/5)
# 4. Load text encoder from facebook/mms-tts-hau
# 5. Train with full VITS losses
# 6. Push final model to Hub

Run on HuggingFace Jobs

# Recommended: A10G (24GB VRAM) or better
huggingface-cli jobs run train_hausa_tts.py \
  --hardware a10g-large \
  --timeout 8h \
  --dependencies torch torchaudio transformers datasets librosa trackio huggingface_hub soundfile numpy scipy

Recommended Hardware

Minimum: T4 (16GB) with batch_size=8
Recommended: A10G (24GB) with batch_size=16
Optimal: A100 (80GB) with batch_size=64

Inference

import torch
import json

# Load the trained model
checkpoint = torch.load("generator.pth", map_location="cpu")
config = json.load(open("config.json"))

# Rebuild the model (copy SynthesizerTrn class from train_hausa_tts.py)
# Or import it:
# from train_hausa_tts import SynthesizerTrn

model = SynthesizerTrn(
    n_vocab=config["vocab_size"],
    spec_channels=config["n_fft"] // 2 + 1,
    segment_size=config["segment_size"] // config["hop_length"],
    inter_channels=config["inter_channels"],
    hidden_channels=config["hidden_size"],
    filter_channels=config["filter_channels"],
    n_heads=config["n_heads"],
    n_layers=config["n_layers"],
    kernel_size=config["kernel_size"],
    p_dropout=config["p_dropout"],
    resblock_kernel_sizes=config["resblock_kernel_sizes"],
    resblock_dilation_sizes=config["resblock_dilation_sizes"],
    upsample_rates=config["upsample_rates"],
    upsample_initial_channel=config["upsample_initial_channel"],
    upsample_kernel_sizes=config["upsample_kernel_sizes"],
    n_speakers=config["n_speakers"],
    gin_channels=config["gin_channels"],
    use_sdp=True,
)
model.load_state_dict(checkpoint["model"])
model.eval()

# Tokenize text
vocab = config["vocab"]
def text_to_ids(text, vocab, add_blank=True):
    text = text.lower().strip()
    ids = [vocab[c] for c in text if c in vocab]
    if add_blank:
        new_ids = [0] * (len(ids) * 2 + 1)
        new_ids[1::2] = ids
        ids = new_ids
    return ids

# Generate speech
text = "Sannu da zuwa"
text_ids = torch.LongTensor([text_to_ids(text, vocab)])
text_lengths = torch.LongTensor([text_ids.size(1)])
speaker_id = torch.LongTensor([0])  # Choose speaker

with torch.no_grad():
    audio, _, _, _ = model.infer(text_ids, text_lengths, speaker_id)
    audio = audio.squeeze().cpu().numpy()

# Save
import soundfile as sf
sf.write("output.wav", audio, 16000)

Citation

@inproceedings{kim2021conditional,
  title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
  author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
  booktitle={International Conference on Machine Learning},
  year={2021}
}

@article{pratap2023scaling,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap, Vineel and others},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}

License

CC-BY-NC-4.0 (following the base model license)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for suleiman2003/hausa-multi-speaker-vits-tts

Base model

facebook/mms-tts-hau

Finetuned

(5)

this model

Dataset used to train suleiman2003/hausa-multi-speaker-vits-tts

Papers for suleiman2003/hausa-multi-speaker-vits-tts

Scaling Speech Technology to 1,000+ Languages

Paper • 2305.13516 • Published May 22, 2023 • 12

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Paper • 2106.06103 • Published Jun 11, 2021 • 4