#!/home/haroon/python_virtual_envs/whisper_fine_tuning/bin/python #!/usr/bin/env python # coding: utf-8 # # Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers # In this Colab, we present a step-by-step guide on how to fine-tune Whisper # for any multilingual ASR dataset using Hugging Face 🤗 Transformers. This is a # more "hands-on" version of the accompanying [blog post](https://huggingface.co/blog/fine-tune-whisper). # For a more in-depth explanation of Whisper, the Common Voice dataset and the theory behind fine-tuning, the reader is advised to refer to the blog post. # ## Introduction # Whisper is a pre-trained model for automatic speech recognition (ASR) # published in [September 2022](https://openai.com/blog/whisper/) by the authors # Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as # [Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained # on un-labelled audio data, Whisper is pre-trained on a vast quantity of # **labelled** audio-transcription data, 680,000 hours to be precise. # This is an order of magnitude more data than the un-labelled audio data used # to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this # pre-training data is multilingual ASR data. This results in checkpoints # that can be applied to over 96 languages, many of which are considered # _low-resource_. # # When scaled to 680,000 hours of labelled pre-training data, Whisper models # demonstrate a strong ability to generalise to many datasets and domains. # The pre-trained checkpoints achieve competitive results to state-of-the-art # ASR systems, with near 3% word error rate (WER) on the test-clean subset of # LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._ # Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)). # The extensive multilingual ASR knowledge acquired by Whisper during pre-training # can be leveraged for other low-resource languages; through fine-tuning, the # pre-trained checkpoints can be adapted for specific datasets and languages # to further improve upon these results. We'll show just how Whisper can be fine-tuned # for low-resource languages in this Colab. #

Trulli — **Figure 1:** Whisper model. The architecture # follows the standard Transformer-based encoder-decoder model. A # log-Mel spectrogram is input to the encoder. The last encoder # hidden states are input to the decoder via cross-attention mechanisms. The # decoder autoregressively predicts text tokens, jointly conditional on the # encoder hidden states and previously predicted tokens. Figure source: # OpenAI Whisper Blog.

# The Whisper checkpoints come in five configurations of varying model sizes. # The smallest four are trained on either English-only or multilingual data. # The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints # are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The # checkpoints are summarised in the following table with links to the models on the Hub: # # | Size | Layers | Width | Heads | Parameters | English-only | Multilingual | # |--------|--------|-------|-------|------------|------------------------------------------------------|---------------------------------------------------| # | tiny | 4 | 384 | 6 | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny.) | # | base | 6 | 512 | 8 | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) | # | small | 12 | 768 | 12 | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) | # | medium | 24 | 1024 | 16 | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) | # | large | 32 | 1280 | 20 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) | # # For demonstration purposes, we'll fine-tune the multilingual version of the # [`"small"`](https://huggingface.co/openai/whisper-small) checkpoint with 244M params (~= 1GB). # As for our data, we'll train and evaluate our system on a low-resource language # taken from the [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) # dataset. We'll show that with as little as 8 hours of fine-tuning data, we can achieve # strong performance in this language. # ------------------------------------------------------------------------ # # \\({}^1\\) The name Whisper follows from the acronym “WSPSR”, which stands for “Web-scale Supervised Pre-training for Speech Recognition”. # ## Load Dataset # Using 🤗 Datasets, downloading and preparing data is extremely simple. # We can download and prepare the Common Voice splits in just one line of code. # # First, ensure you have accepted the terms of use on the Hugging Face Hub: [mozilla-foundation/common_voice_11_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally. # # Since Hindi is very low-resource, we'll combine the `train` and `validation` # splits to give approximately 8 hours of training data. We'll use the 4 hours # of `test` data as our held-out test set: # In[1]: from datasets import load_dataset, DatasetDict common_voice = DatasetDict() common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", token=True) common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", token=True) print(common_voice) # Most ASR datasets only provide input audio samples (`audio`) and the # corresponding transcribed text (`sentence`). Common Voice contains additional # metadata information, such as `accent` and `locale`, which we can disregard for ASR. # Keeping the notebook as general as possible, we only consider the input audio and # transcribed text for fine-tuning, discarding the additional metadata information: # In[2]: common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]) print(common_voice) # ## Prepare Feature Extractor, Tokenizer and Data # The ASR pipeline can be de-composed into three stages: # 1) A feature extractor which pre-processes the raw audio-inputs # 2) The model which performs the sequence-to-sequence mapping # 3) A tokenizer which post-processes the model outputs to text format # # In 🤗 Transformers, the Whisper model has an associated feature extractor and tokenizer, # called [WhisperFeatureExtractor](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperFeatureExtractor) # and [WhisperTokenizer](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperTokenizer) # respectively. # # We'll go through details for setting-up the feature extractor and tokenizer one-by-one! # ### Load WhisperFeatureExtractor # The Whisper feature extractor performs two operations: # 1. Pads / truncates the audio inputs to 30s: any audio inputs shorter than 30s are padded to 30s with silence (zeros), and those longer that 30s are truncated to 30s # 2. Converts the audio inputs to _log-Mel spectrogram_ input features, a visual representation of the audio and the form of the input expected by the Whisper model #