Audio Course documentation

Hands-on exercise

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Hands-on exercise

In this unit, we have explored text-to-speech audio task, talked about existing datasets, pretrained models and nuances of fine-tuning SpeechT5 for a new language.

As you’ve seen, fine-tuning models for text-to-speech task can be challenging in low-resource scenarios. At the same time, evaluating text-to-speech models isn’t easy either.

For these reasons, this hands-on exercise will focus on practicing the skills rather than achieving a certain metric value.

Your objective for this task is to fine-tune SpeechT5 on a dataset of your choosing. You have the freedom to select another language from the same voxpopuli dataset, or you can pick any other dataset listed in this unit.

Be mindful of the training data size! For training on a free tier GPU from Google Colab, we recommend limiting the training data to about 10-15 hours.

Once you have completed the fine-tuning process, share your model by uploading it to the Hub. Make sure to tag your model as a text-to-speech model either with appropriate kwargs, or in the Hub UI.

Remember, the primary aim of this exercise is to provide you with ample practice, allowing you to refine your skills and gain a deeper understanding of text-to-speech audio tasks.