Instructions to use riyadhrazzaq/unit-hifigan-vocoder-euskara with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- speechbrain
How to use riyadhrazzaq/unit-hifigan-vocoder-euskara with speechbrain:
# interface not specified in config.json
- Notebooks
- Google Colab
- Kaggle
Unit HiFi-GAN Model Card
Model Overview
This checkpoint contains a SpeechBrain Unit HiFi-GAN vocoder. It converts discrete speech units into waveform audio and uses the speechbrain.lobes.models.HifiGAN.UnitHifiganGenerator architecture together with a HiFi-GAN discriminator.
The saved hyperparameters indicate a multi-speaker discrete-unit setup with the following key settings:
vocab_size: 1001embedding_dim: 1024in_channels: 1216out_channels: 1resblock_type:1upsample_factors:[5, 4, 4, 2, 2]upsample_kernel_sizes:[11, 8, 8, 4, 4]duration_predictor:Falsemulti_speaker:True
The speaker encoder used for training and speaker conditioning was speechbrain/spkrec-ecapa-voxceleb-mel-spec. During inference, the vocoder is driven by precomputed speaker embeddings, with the provided script mapping speaker names such as miren, nerea, and jon to their corresponding embedding files.
Following voices are supported with their indices being the speaker id.
Find their respective speaker embedding vector in ./speaker_embeddings/{idx}_XXXXX.npy.
For example, klara_eu speaker embedding is stored in ./speaker_embedding/9_*.npy.
["aintzane_eu", "alex", "amaia_eu", "andrea_eu", "inaki_eu", "jaione_eu", "jon", "karolina_eu", "kepa_eu", "kiko_eu", "klara_eu", "Maider", "miren", "monika_eu", "nerea", "pello2004_eu", "pello_eu", "xabier_eu"]
The discrete input tokens are extracted with the K-means model stored in the local kmeans/ folder for this experiment.
Intended Use
This model is intended for research and inference workflows that need waveform synthesis from discrete speech units. It is suitable for unit-based TTS or speech-to-speech pipelines when the unit extractor, tokenization, and sampling settings match the training setup.
Model Inputs and Outputs
Input:
- A sequence of discrete speech units.
- Optional speaker conditioning, when used by the surrounding pipeline.
Speaker conditioning is supplied through the speaker embedding extracted by the ECAPA-TDNN speaker encoder above, rather than by raw speaker IDs.
Output:
- A generated waveform with one audio channel.
Training and Checkpoint Notes
This folder stores the checkpoint state at epoch 500, along with the generator, discriminator, optimizer, and scheduler states used during training.
The exact corpus used for this run is not documented in this README. Use the matching experiment configuration or recipe alongside this checkpoint if you need the original data provenance, preprocessing, or evaluation protocol.
Limitations
- Output quality depends on using the same or compatible unit extractor and preprocessing pipeline used during training.
- This checkpoint is not guaranteed to generalize well to out-of-domain speakers, recording conditions, or unit tokenizers.
- The README does not report a formal benchmark table, so treat this as a model artifact description rather than an evaluation report.
Loading
In SpeechBrain, this checkpoint is typically loaded with speechbrain.inference.vocoders.UnitHIFIGAN using the checkpoint directory as the source.
Training command used by the recipe (examples only):
python ./recipes/Euskara/TTS/vocoder/hifigan_discrete/train_spk.py \
./recipes/Euskara/TTS/vocoder/hifigan_discrete/hparams/train_spk_sonora_2.yaml \
--data_folder=/data/aholab/tts/eu/hifigan_spk/
Inference command used by the recipe:
python ./recipes/Euskara/TTS/vocoder/hifigan_discrete/infer.py \
--input_path /path/to/audio_or_folder \
--hubert_repo utter-project/mHuBERT-147 \
--vocoder_repo /path/to/this/checkpoint \
--kmeans_path kmeans/basque_hubert_k1000_L9.pt \
--spk miren nerea jon
Example kmeans code extraction:
kmeans = joblib.load("/scratch/mriyadh/speechbrain/models/kmeans/kmeans__utter-project_mhubert-147__K1000__L9.pt")
discreet_codes = np.array(kmeans.predict(features))
Re-training Steps
- Organize speaker audio files into the following directory structure:
data/
speaker1/
audio1.wav
audio2.wav
speakerN/
audio1.wav
audio2.wav
Configure hyperparameters by modifying the YAML files in
recipes/Euskara/TTS/vocoder/hifigan_discrete/hparams/.Start training using the command shown in the Training section above.
Training K-means
To train the K-means model, refer to the kmeans/train.py script in the riyadhrazzaq/llama_omni_asr_tts repository.
Citation
If you use this checkpoint, please cite the relevant SpeechBrain and HiFi-GAN work, and the Unit HiFi-GAN variant if applicable to your experiment.
Suggested references:
- SpeechBrain: https://arxiv.org/abs/2106.04624
- HiFi-GAN: https://arxiv.org/abs/2010.05646
- Unit HiFi-GAN / scalable unit vocoder variant referenced in the implementation: https://arxiv.org/abs/2406.10735
Note: This README was generated using AI.
- Downloads last month
- 87
Papers for riyadhrazzaq/unit-hifigan-vocoder-euskara
SpeechBrain: A General-Purpose Speech Toolkit
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Evaluation results
- MCD on Euskara TTS from Aholabself-reported4.850