Unit HiFi-GAN Model Card

Model Overview

This checkpoint contains a SpeechBrain Unit HiFi-GAN vocoder. It converts discrete speech units into waveform audio and uses the speechbrain.lobes.models.HifiGAN.UnitHifiganGenerator architecture together with a HiFi-GAN discriminator.

The saved hyperparameters indicate a multi-speaker discrete-unit setup with the following key settings:

vocab_size: 1001
embedding_dim: 1024
in_channels: 1216
out_channels: 1
resblock_type: 1
upsample_factors: [5, 4, 4, 2, 2]
upsample_kernel_sizes: [11, 8, 8, 4, 4]
duration_predictor: False
multi_speaker: True

The speaker encoder used for training and speaker conditioning was speechbrain/spkrec-ecapa-voxceleb-mel-spec. During inference, the vocoder is driven by precomputed speaker embeddings, with the provided script mapping speaker names such as miren, nerea, and jon to their corresponding embedding files. Following voices are supported with their indices being the speaker id. Find their respective speaker embedding vector in ./speaker_embeddings/{idx}_XXXXX.npy. For example, klara_eu speaker embedding is stored in ./speaker_embedding/9_*.npy.

["aintzane_eu", "alex", "amaia_eu", "andrea_eu", "inaki_eu", "jaione_eu", "jon", "karolina_eu", "kepa_eu", "kiko_eu", "klara_eu", "Maider", "miren", "monika_eu", "nerea", "pello2004_eu", "pello_eu", "xabier_eu"]

The discrete input tokens are extracted with the K-means model stored in the local kmeans/ folder for this experiment.

Intended Use

This model is intended for research and inference workflows that need waveform synthesis from discrete speech units. It is suitable for unit-based TTS or speech-to-speech pipelines when the unit extractor, tokenization, and sampling settings match the training setup.

Model Inputs and Outputs

Input:

A sequence of discrete speech units.
Optional speaker conditioning, when used by the surrounding pipeline.

Speaker conditioning is supplied through the speaker embedding extracted by the ECAPA-TDNN speaker encoder above, rather than by raw speaker IDs.

Output:

A generated waveform with one audio channel.

Training and Checkpoint Notes

This folder stores the checkpoint state at epoch 500, along with the generator, discriminator, optimizer, and scheduler states used during training.

The exact corpus used for this run is not documented in this README. Use the matching experiment configuration or recipe alongside this checkpoint if you need the original data provenance, preprocessing, or evaluation protocol.

Limitations

Output quality depends on using the same or compatible unit extractor and preprocessing pipeline used during training.
This checkpoint is not guaranteed to generalize well to out-of-domain speakers, recording conditions, or unit tokenizers.
The README does not report a formal benchmark table, so treat this as a model artifact description rather than an evaluation report.

Loading

In SpeechBrain, this checkpoint is typically loaded with speechbrain.inference.vocoders.UnitHIFIGAN using the checkpoint directory as the source.

Training command used by the recipe (examples only):

python ./recipes/Euskara/TTS/vocoder/hifigan_discrete/train_spk.py \
    ./recipes/Euskara/TTS/vocoder/hifigan_discrete/hparams/train_spk_sonora_2.yaml \
    --data_folder=/data/aholab/tts/eu/hifigan_spk/

Inference command used by the recipe:

python ./recipes/Euskara/TTS/vocoder/hifigan_discrete/infer.py \
    --input_path /path/to/audio_or_folder \
    --hubert_repo utter-project/mHuBERT-147 \
    --vocoder_repo /path/to/this/checkpoint \
    --kmeans_path kmeans/basque_hubert_k1000_L9.pt \
    --spk miren nerea jon

Example kmeans code extraction:

    kmeans = joblib.load("/scratch/mriyadh/speechbrain/models/kmeans/kmeans__utter-project_mhubert-147__K1000__L9.pt")
    discreet_codes = np.array(kmeans.predict(features))

Re-training Steps

Organize speaker audio files into the following directory structure:

data/
    speaker1/
        audio1.wav
        audio2.wav
    speakerN/
        audio1.wav
        audio2.wav

Configure hyperparameters by modifying the YAML files in recipes/Euskara/TTS/vocoder/hifigan_discrete/hparams/.
Start training using the command shown in the Training section above.

Training K-means

To train the K-means model, refer to the kmeans/train.py script in the riyadhrazzaq/llama_omni_asr_tts repository.

Citation

If you use this checkpoint, please cite the relevant SpeechBrain and HiFi-GAN work, and the Unit HiFi-GAN variant if applicable to your experiment.

Suggested references:

SpeechBrain: https://arxiv.org/abs/2106.04624
HiFi-GAN: https://arxiv.org/abs/2010.05646
Unit HiFi-GAN / scalable unit vocoder variant referenced in the implementation: https://arxiv.org/abs/2406.10735

Note: This README was generated using AI.

Downloads last month: 87

Papers for riyadhrazzaq/unit-hifigan-vocoder-euskara

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Paper • 2406.10735 • Published Jun 15, 2024

SpeechBrain: A General-Purpose Speech Toolkit

Paper • 2106.04624 • Published Jun 8, 2021 • 2

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Paper • 2010.05646 • Published Oct 12, 2020

Evaluation results

MCD on Euskara TTS from Aholab
self-reported

4.850