#!/home/haroon/python_virtual_envs/whisper_fine_tuning/bin/python
#!/usr/bin/env python
# coding: utf-8
# # Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
# In this Colab, we present a step-by-step guide on how to fine-tune Whisper
# for any multilingual ASR dataset using Hugging Face 🤗 Transformers. This is a
# more "hands-on" version of the accompanying [blog post](https://huggingface.co/blog/fine-tune-whisper).
# For a more in-depth explanation of Whisper, the Common Voice dataset and the theory behind fine-tuning, the reader is advised to refer to the blog post.
# ## Introduction
# Whisper is a pre-trained model for automatic speech recognition (ASR)
# published in [September 2022](https://openai.com/blog/whisper/) by the authors
# Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as
# [Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained
# on un-labelled audio data, Whisper is pre-trained on a vast quantity of
# **labelled** audio-transcription data, 680,000 hours to be precise.
# This is an order of magnitude more data than the un-labelled audio data used
# to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this
# pre-training data is multilingual ASR data. This results in checkpoints
# that can be applied to over 96 languages, many of which are considered
# _low-resource_.
#
# When scaled to 680,000 hours of labelled pre-training data, Whisper models
# demonstrate a strong ability to generalise to many datasets and domains.
# The pre-trained checkpoints achieve competitive results to state-of-the-art
# ASR systems, with near 3% word error rate (WER) on the test-clean subset of
# LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._
# Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).
# The extensive multilingual ASR knowledge acquired by Whisper during pre-training
# can be leveraged for other low-resource languages; through fine-tuning, the
# pre-trained checkpoints can be adapted for specific datasets and languages
# to further improve upon these results. We'll show just how Whisper can be fine-tuned
# for low-resource languages in this Colab.
#
# The Whisper checkpoints come in five configurations of varying model sizes.
# The smallest four are trained on either English-only or multilingual data.
# The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints
# are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
# checkpoints are summarised in the following table with links to the models on the Hub:
#
# | Size | Layers | Width | Heads | Parameters | English-only | Multilingual |
# |--------|--------|-------|-------|------------|------------------------------------------------------|---------------------------------------------------|
# | tiny | 4 | 384 | 6 | 39 M | [âś“](https://huggingface.co/openai/whisper-tiny.en) | [âś“](https://huggingface.co/openai/whisper-tiny.) |
# | base | 6 | 512 | 8 | 74 M | [âś“](https://huggingface.co/openai/whisper-base.en) | [âś“](https://huggingface.co/openai/whisper-base) |
# | small | 12 | 768 | 12 | 244 M | [âś“](https://huggingface.co/openai/whisper-small.en) | [âś“](https://huggingface.co/openai/whisper-small) |
# | medium | 24 | 1024 | 16 | 769 M | [âś“](https://huggingface.co/openai/whisper-medium.en) | [âś“](https://huggingface.co/openai/whisper-medium) |
# | large | 32 | 1280 | 20 | 1550 M | x | [âś“](https://huggingface.co/openai/whisper-large) |
#
# For demonstration purposes, we'll fine-tune the multilingual version of the
# [`"small"`](https://huggingface.co/openai/whisper-small) checkpoint with 244M params (~= 1GB).
# As for our data, we'll train and evaluate our system on a low-resource language
# taken from the [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
# dataset. We'll show that with as little as 8 hours of fine-tuning data, we can achieve
# strong performance in this language.
# ------------------------------------------------------------------------
#
# \\({}^1\\) The name Whisper follows from the acronym “WSPSR”, which stands for “Web-scale Supervised Pre-training for Speech Recognition”.
# ## Load Dataset
# Using 🤗 Datasets, downloading and preparing data is extremely simple.
# We can download and prepare the Common Voice splits in just one line of code.
#
# First, ensure you have accepted the terms of use on the Hugging Face Hub: [mozilla-foundation/common_voice_11_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.
#
# Since Hindi is very low-resource, we'll combine the `train` and `validation`
# splits to give approximately 8 hours of training data. We'll use the 4 hours
# of `test` data as our held-out test set:
# In[1]:
from datasets import load_dataset, DatasetDict
common_voice = DatasetDict()
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", token=True)
print(common_voice)
# Most ASR datasets only provide input audio samples (`audio`) and the
# corresponding transcribed text (`sentence`). Common Voice contains additional
# metadata information, such as `accent` and `locale`, which we can disregard for ASR.
# Keeping the notebook as general as possible, we only consider the input audio and
# transcribed text for fine-tuning, discarding the additional metadata information:
# In[2]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])
print(common_voice)
# ## Prepare Feature Extractor, Tokenizer and Data
# The ASR pipeline can be de-composed into three stages:
# 1) A feature extractor which pre-processes the raw audio-inputs
# 2) The model which performs the sequence-to-sequence mapping
# 3) A tokenizer which post-processes the model outputs to text format
#
# In 🤗 Transformers, the Whisper model has an associated feature extractor and tokenizer,
# called [WhisperFeatureExtractor](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperFeatureExtractor)
# and [WhisperTokenizer](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperTokenizer)
# respectively.
#
# We'll go through details for setting-up the feature extractor and tokenizer one-by-one!
# ### Load WhisperFeatureExtractor
# The Whisper feature extractor performs two operations:
# 1. Pads / truncates the audio inputs to 30s: any audio inputs shorter than 30s are padded to 30s with silence (zeros), and those longer that 30s are truncated to 30s
# 2. Converts the audio inputs to _log-Mel spectrogram_ input features, a visual representation of the audio and the form of the input expected by the Whisper model
#