Text-to-Speech
Hausa
vits
multi-speaker
hausa
tts

Hausa Multi-Speaker VITS TTS

Multi-speaker text-to-speech model for Hausa language based on the VITS architecture (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech).

Architecture

Built on the VITS architecture with multi-speaker conditioning via learned speaker embeddings (256-dim).

What's transferred from facebook/mms-tts-hau:

  • Text encoder (6-layer Transformer): Hausa phoneme/character knowledge
  • Vocabulary (34 characters): Hausa-specific character set including ƙ, ɓ, ɗ, ā, etc.

What's trained from scratch (NO voice leakage):

  • 🆕 Speaker embedding table: nn.Embedding(n_speakers, 256) — learned lookup
  • 🆕 HiFi-GAN decoder: Waveform generator with speaker conditioning
  • 🆕 Posterior encoder: 16-layer WaveNet with speaker conditioning
  • 🆕 Normalizing flow: 4 coupling layers with speaker conditioning
  • 🆕 Stochastic duration predictor: With speaker conditioning
  • 🆕 Multi-Period Discriminator: 6 sub-discriminators for adversarial training

Key Design Decisions:

  1. No original voice leakage: Only the text encoder (speaker-independent by design in VITS) is loaded from the pretrained model. All voice-producing components are trained from scratch.
  2. Speaker conditioning everywhere: Speaker embeddings are injected into decoder, posterior encoder, flow, and duration predictor via global conditioning.
  3. Text encoder frozen: Preserves Hausa linguistic knowledge learned from MMS pretraining.

Training Data

Model Specs

Component Details
Architecture VITS (multi-speaker)
Language Hausa (hau)
Sampling Rate 16,000 Hz
Hidden Size 192
FFN Dim 768
Attention Heads 2
Transformer Layers 6
Speaker Embedding Dim 256
Generator Params ~40M
Discriminator Params ~47M
Upsample Rates [8, 8, 2, 2]

Training Configuration

Hyperparameter Value
Learning Rate 2e-4
Optimizer AdamW (β₁=0.8, β₂=0.99)
LR Schedule ExponentialLR (γ=0.999875)
Batch Size 16
Epochs 200
Mel Loss Weight 45
KL Loss Weight 1.0
FP16
Segment Size 8192 samples
Max Audio Length 10 seconds

Losses

The model is trained with 5 losses following the original VITS paper:

  1. Mel reconstruction loss (L1, weight=45): Ensures spectral fidelity
  2. KL divergence loss (weight=1.0): Aligns prior and posterior distributions
  3. Adversarial loss: Least-squares GAN for realistic waveforms
  4. Feature matching loss: Matches discriminator intermediate features
  5. Duration loss: Stochastic duration predictor negative log-likelihood

How to Train

Prerequisites

pip install -r requirements.txt

Run Training

# Single GPU
python train_hausa_tts.py

# The script will:
# 1. Load the dataset from HuggingFace Hub
# 2. Auto-detect speakers and remap IDs
# 3. Split into train/val (95/5)
# 4. Load text encoder from facebook/mms-tts-hau
# 5. Train with full VITS losses
# 6. Push final model to Hub

Run on HuggingFace Jobs

# Recommended: A10G (24GB VRAM) or better
huggingface-cli jobs run train_hausa_tts.py \
  --hardware a10g-large \
  --timeout 8h \
  --dependencies torch torchaudio transformers datasets librosa trackio huggingface_hub soundfile numpy scipy

Recommended Hardware

  • Minimum: T4 (16GB) with batch_size=8
  • Recommended: A10G (24GB) with batch_size=16
  • Optimal: A100 (80GB) with batch_size=64

Inference

import torch
import json

# Load the trained model
checkpoint = torch.load("generator.pth", map_location="cpu")
config = json.load(open("config.json"))

# Rebuild the model (copy SynthesizerTrn class from train_hausa_tts.py)
# Or import it:
# from train_hausa_tts import SynthesizerTrn

model = SynthesizerTrn(
    n_vocab=config["vocab_size"],
    spec_channels=config["n_fft"] // 2 + 1,
    segment_size=config["segment_size"] // config["hop_length"],
    inter_channels=config["inter_channels"],
    hidden_channels=config["hidden_size"],
    filter_channels=config["filter_channels"],
    n_heads=config["n_heads"],
    n_layers=config["n_layers"],
    kernel_size=config["kernel_size"],
    p_dropout=config["p_dropout"],
    resblock_kernel_sizes=config["resblock_kernel_sizes"],
    resblock_dilation_sizes=config["resblock_dilation_sizes"],
    upsample_rates=config["upsample_rates"],
    upsample_initial_channel=config["upsample_initial_channel"],
    upsample_kernel_sizes=config["upsample_kernel_sizes"],
    n_speakers=config["n_speakers"],
    gin_channels=config["gin_channels"],
    use_sdp=True,
)
model.load_state_dict(checkpoint["model"])
model.eval()

# Tokenize text
vocab = config["vocab"]
def text_to_ids(text, vocab, add_blank=True):
    text = text.lower().strip()
    ids = [vocab[c] for c in text if c in vocab]
    if add_blank:
        new_ids = [0] * (len(ids) * 2 + 1)
        new_ids[1::2] = ids
        ids = new_ids
    return ids

# Generate speech
text = "Sannu da zuwa"
text_ids = torch.LongTensor([text_to_ids(text, vocab)])
text_lengths = torch.LongTensor([text_ids.size(1)])
speaker_id = torch.LongTensor([0])  # Choose speaker

with torch.no_grad():
    audio, _, _, _ = model.infer(text_ids, text_lengths, speaker_id)
    audio = audio.squeeze().cpu().numpy()

# Save
import soundfile as sf
sf.write("output.wav", audio, 16000)

Citation

@inproceedings{kim2021conditional,
  title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
  author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
  booktitle={International Conference on Machine Learning},
  year={2021}
}

@article{pratap2023scaling,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap, Vineel and others},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}

License

CC-BY-NC-4.0 (following the base model license)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for suleiman2003/hausa-multi-speaker-vits-tts

Finetuned
(5)
this model

Dataset used to train suleiman2003/hausa-multi-speaker-vits-tts

Papers for suleiman2003/hausa-multi-speaker-vits-tts