StyleTTS2 — Basque Multispeaker TTS

This is a Basque text-to-speech (TTS) model based on the StyleTTS2 architecture, specifically adapted for Basque language synthesis. The model achieves good-quality Basque speech synthesis. The model was trained from scratch on the Basque multispeaker Sonora speech corpus.

Examples (playable):

  • Sample 1 — "Cesare Pavese XXI. mendeko idazle italiar esanguratzuenetakoa da."

  • Sample 2 — "Herriko errekan bakarrik korrika."

Main modifications:

  • PL-BERT-eu: PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
  • ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original ASR from StyleTTS2.
  • Phonemizer: We used code developed by Aholab to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at arrandi/phonemizer-eus-esp. Likewise, the code used to generate IPA phonemes can be found in the phonemizer directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.

Model details

Architecture StyleTTS2 (from scratch)
Language Basque (eu)
Speakers Multispeaker (two speakers)
Text input Basque IPA phonemes
Speech LM WavLM-Base-Plus
Sample rate 24 000 Hz
Decoder HiFiGAN

Training dataset

Sonora multispeaker Basque speech dataset.

  • Number of speakers: two speakers
  • Audio: 13,500 utterances per speaker, totalling 34 hours and 18 minutes.
  • Dataset split: We used 100 samples for validation and 500 for testing.
  • OOD dataset: We use a different text dataset as the Out-of-Distribution (OOD) dataset.

Training

Brief summary of training parameters used (from config_basque_multispeaker_phoneme_wavlm_800.yml):

  • Device: cuda
  • Stages: 1st-stage epochs = 50; 2nd-stage epochs = 30
  • Batch: batch_size = 2
  • Max length: max_len = 500
  • Learning rates: lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
  • Audio / features: sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
  • Model: multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
  • Diffusion / schedule: diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
  • Loss highlights: lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0

Files in this repository

File Description
config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml Training & model config → place at Models/Basque_Multispeaker_Phoneme_wavlm_normal/
epoch_2nd_00030.pth Main TTS checkpoint → place at Models/Basque_Multispeaker_Phoneme_wavlm_normal/
epoch_00200.pth Basque ASR / text aligner → place at Utils/ASR_basque/
step_4000000.t7 Phoneme PLBERT → place at Utils/PLBERT_phoneme/

Note: The JDC F0 extractor (Utils/JDC/bst.t7) is not Basque-specific — download it from the original StyleTTS2 repository and place it at Utils/JDC/bst.t7.

Setup

# 1. Clone the code repository
git clone https://github.com/AArriandiaga/StyleTTS2_basque
cd StyleTTS2_basque

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download model weights from this HF repo and place them:
mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_normal Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7

# using huggingface_hub:
python - <<'EOF'
from huggingface_hub import hf_hub_download
import shutil

repo = "HiTZ/styletts2-basque"
files = {
    "config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml",
    "epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth",
    "epoch_00200.pth":     "Utils/ASR_basque/epoch_00200.pth",
    "step_4000000.t7":     "Utils/PLBERT_phoneme/step_4000000.t7",
}
# bst.t7 comes from the original StyleTTS2 repo — download separately:
# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
for hf_name, local_path in files.items():
    src = hf_hub_download(repo_id=repo, filename=hf_name)
    shutil.copy(src, local_path)
    print(f"✓ {local_path}")
EOF

Inference

CLI:

python inference.py \
    --config  Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml \
    --model   Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth \
    --ref     Demo/ref_antton.wav \
    --text    "Kaixo, zelan zaude?" \
    --output  output/kaixo.wav

Python API:

from inference import Synthesizer

synth = Synthesizer(
    config='Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml',
    checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth',
    default_ref='Demo/ref_antton.wav',
)

wav = synth.run("Kaixo, zelan zaude?")
synth.save(wav, "output/kaixo.wav")

# Different speaker
wav2 = synth.run("Arratsalde on!", ref='Demo/ref_maider.wav')
synth.save(wav2, "output/arratsalde.wav")

Key parameters for run():

Parameter Default Description
ref constructor default Reference WAV for speaker style
alpha 0.3 Timbre mixing (0 = reference, 1 = sampled)
beta 0.7 Prosody mixing (0 = reference, 1 = sampled)
diffusion_steps 5 Quality vs. speed trade-off
embedding_scale 1.0 Expressiveness (>1 = more expressive)

Reference speakers

Two reference audios are included in the repo under Demo/:

  • ref_antton.wav — male speaker
  • ref_maider.wav — female speaker

All credit goes to the authors of StyleTTS2.

Citation

@inproceedings{li2023styletts2,
  title     = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author    = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2023},
}

Additional Information

Author

Author: Ander Arriandiaga — Aholab (Hitz), EHU

Contact

For further information, please send an email to inma.hernaez@ehu.eus.

Copyright

Copyright(c) 2026 by Aholab, HiTZ.

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support