Audio Course documentation

Supplemental reading and resources

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Supplemental reading and resources

This unit provided a hands-on introduction to speech recognition, one of the most popular tasks in the audio domain. Want to learn more? Here you will find additional resources that will help you deepen your understanding of the topics and enhance your learning experience.

  • Whisper Talk by Jong Wook Kim: a presentation on the Whisper model, explaining the motivation, architecture, training and results, delivered by Whisper author Jong Wook Kim
  • End-to-End Speech Benchmark (ESB): a paper that comprehensively argues for using the orthographic WER as opposed to the normalised WER for evaluating ASR systems and presents an accompanying benchmark
  • Fine-Tuning Whisper for Multilingual ASR: an in-depth blog post that explains how the Whisper model works in more detail, and the pre- and post-processing steps involved with the feature extractor and tokenizer
  • Fine-tuning MMS Adapter Models for Multi-Lingual ASR: an end-to-end guide for fine-tuning Meta AI’s new MMS speech recognition models, freezing the base model weights and only fine-tuning a small number of adapter layers
  • Boosting Wav2Vec2 with n-grams in 🤗 Transformers: a blog post for combining CTC models with external language models (LMs) to combat spelling and punctuation errors