In this unit, we explored the challenges of fine-tuning ASR models, acknowledging the time and resources required to fine-tune a model like Whisper (even a small checkpoint) on a new language. To provide a hands-on experience, we have designed an exercise that allows you to navigate the process of fine-tuning an ASR model while using a smaller dataset. The main goal of this exercise is to familiarize you with the process rather than expecting production-level results. We have intentionally set a low metric to ensure that even with limited resources, you should be able to achieve it.

Here are the instructions:

  • Fine-tune the ”openai/whisper-tiny” model using the American English (“en-US”) subset of the ”PolyAI/minds14” dataset.
  • Use the first 450 examples for training, and the rest for evaluation. Ensure you set num_proc=1 when pre-processing the dataset using the .map method (this will ensure your model is submitted correctly for assessment).
  • To evaluate the model, use the wer and wer_ortho metrics as described in this Unit. However, do not convert the metric into percentages by multiplying by 100 (E.g. if WER is 42%, we’ll expect to see the value of 0.42 in this exercise).

Once you have fine-tuned a model, make sure to upload it to the 🤗 Hub with the following kwargs:

kwargs = {
     "dataset_tags": "PolyAI/minds14",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",

You will pass this assignment if your model’s normalised WER (wer) is lower than 0.37.

Feel free to build a demo of your model, and share it on Discord! If you have questions, post them in the #audio-study-group channel.