Rakib commited on
Commit
672187c
1 Parent(s): f2c49b1

Training in progress, step 1000

Browse files
Files changed (30) hide show
  1. .gitignore +1 -0
  2. README.md +1244 -0
  3. added_tokens.json +109 -0
  4. config.json +41 -0
  5. ds_config.json +50 -0
  6. fine-tune-whisper-non-streaming.ipynb +1207 -0
  7. fine-tune-whisper-streaming.ipynb +883 -0
  8. interleave_streaming_datasets.ipynb +358 -0
  9. merges.txt +0 -0
  10. normalizer.json +1742 -0
  11. preprocessor_config.json +0 -0
  12. pytorch_model.bin +3 -0
  13. requirements.txt +9 -0
  14. run.sh +42 -0
  15. run_eval_whisper_streaming.py +162 -0
  16. run_speech_recognition_seq2seq_streaming.py +635 -0
  17. runs/Jan18_12-56-26_mamun-desktop/1674026928.3192365/events.out.tfevents.1674026928.mamun-desktop.5642.1 +3 -0
  18. runs/Jan18_12-56-26_mamun-desktop/events.out.tfevents.1674026928.mamun-desktop.5642.0 +3 -0
  19. runs/Jan18_13-42-16_mamun-desktop/1674028391.1765018/events.out.tfevents.1674028391.mamun-desktop.414182.1 +3 -0
  20. runs/Jan18_13-42-16_mamun-desktop/events.out.tfevents.1674028391.mamun-desktop.414182.0 +3 -0
  21. runs/Jan18_13-54-03_mamun-desktop/1674028489.236082/events.out.tfevents.1674028489.mamun-desktop.550378.1 +3 -0
  22. runs/Jan18_13-54-03_mamun-desktop/events.out.tfevents.1674028489.mamun-desktop.550378.0 +3 -0
  23. runs/Jan18_13-55-30_mamun-desktop/1674028577.9689484/events.out.tfevents.1674028577.mamun-desktop.550597.1 +3 -0
  24. runs/Jan18_13-55-30_mamun-desktop/events.out.tfevents.1674028577.mamun-desktop.550597.0 +3 -0
  25. runs/Jan18_13-56-56_mamun-desktop/1674028663.0261796/events.out.tfevents.1674028663.mamun-desktop.550835.1 +3 -0
  26. runs/Jan18_13-56-56_mamun-desktop/events.out.tfevents.1674028663.mamun-desktop.550835.0 +3 -0
  27. special_tokens_map.json +133 -0
  28. tokenizer_config.json +36 -0
  29. training_args.bin +3 -0
  30. vocab.json +0 -0
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ checkpoint-*/
README.md ADDED
@@ -0,0 +1,1244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Whisper Fine-Tuning Event 🤗
2
+
3
+ Welcome to the Whisper fine-tuning event 🎙️!
4
+
5
+ For two weeks, we will endeavour to fine-tune the Whisper model to build state-of-the-art speech recognition systems in
6
+ the languages of our choice 🗣. We will work together as a community to achieve this, helping others and learning where
7
+ we can 🤗. If necessary and available, free access to A100 40 GB GPUs will kindly be provided by our cloud compute
8
+ partners, [Lambda](https://lambdalabs.com) 🚀.
9
+
10
+ This document summarises all the relevant information required for the event 📋. Please read it thoroughly
11
+ and make sure to:
12
+ - Sign-up using the [Google form](https://forms.gle/F2bpouvhDpKKisM39)
13
+ - Join the [Hugging Face Discord server](https://hf.co/join/discord) and make sure to assign yourself **@ml-4-audio** role in #role-assignment so that you can access #events channel.
14
+
15
+ ## Table of Contents
16
+
17
+ - [Introduction](#introduction)
18
+ - [Important Dates](#important-dates)
19
+ - [Launch a Lambda Cloud GPU](#launch-a-lambda-cloud-gpu)
20
+ - [Set Up an Environment](#set-up-an-environment)
21
+ - [Data and Pre-Processing](#data-and-pre-processing)
22
+ - [Fine-Tune a Whisper Model](#fine-tune-whisper)
23
+ - [Evaluation](#evaluation)
24
+ - [Building a Demo](#building-a-demo)
25
+ - [Communication and Problems](#communication-and-problems)
26
+ - [Talks](#talks)
27
+ - [Tips and Tricks](#tips-and-tricks)
28
+ - [Feedback](#feedback)
29
+
30
+ ## Introduction
31
+ Whisper is a pre-trained model for automatic speech recognition (ASR) published in [September 2022](https://openai.com/blog/whisper/)
32
+ by the authors Radford et al. from OpenAI. Pre-trained on 680,000 hours of labelled data, it demonstrates a strong ability
33
+ to generalise to different datasets and domains. Through fine-tuning, the performance of this model can be significantly
34
+ boosted for a given language.
35
+
36
+ In this event, we're bringing the community together to fine-tune Whisper in as many languages as possible. Our aim is
37
+ to achieve state-of-the-art on the languages spoken by the community. Together, we can democratise speech recognition
38
+ for all.
39
+
40
+ We are providing training scripts, notebooks, blog posts, talks and compute (where available), so you have all the
41
+ resources you need to participate! You are free to chose your level of participation, from using the template script and setting
42
+ it to your language, right the way through to exploring advanced training methods. We encourage you to participate to
43
+ level that suits you best. We'll be on hand to facilitate this!
44
+
45
+ Participants are allowed to fine-tune their systems on the training data of their choice, including datasets from the
46
+ Hugging Face Hub, web-scraped data from the internet, or private datasets. Whisper models will be evaluated
47
+ on the "test" split of the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
48
+ dataset for the participant's chosen language.
49
+
50
+ We believe that framing the event as a competition is fun! But at the core, the event is about
51
+ fine-tuning Whisper in as many languages as possible as a community. We want to foster an environment where we
52
+ work together, help each other solve bugs, share important findings and ultimately learn something new.
53
+
54
+ This README contains all the information you need for the event. It is structured such that you can read it sequentially,
55
+ section-by-section. **We recommend that you read the document once from start to finish before running any code.** This will
56
+ give you an idea of where to look for the relevant information and an idea of how the event is going to run.
57
+
58
+ ## Important Dates
59
+
60
+ - *Introduction Talk*: 2nd December 2022
61
+ - *Sprint start*: 5th December 2022
62
+ - *Speaker Events*: 5th December 2022
63
+ - *Sprint end*: 19th December 2022
64
+ - *Results*: 23rd December 2022
65
+
66
+ ## Launch a Lambda Cloud GPU
67
+ Where possible, we encourage you to fine-tune Whisper on a local GPU machine. This will mean a faster set-up and more
68
+ familiarity with your device. If you are running on a local GPU machine, you can skip ahead to the next section: [Set Up an Environment](#set-up-an-environment).
69
+
70
+ The training scripts can also be run as a notebook through Google Colab. We recommend you train on Google Colab if you
71
+ have a "Colab Pro" or "Pro+" subscription. This is to ensure that you receive a sufficiently powerful GPU on your Colab for
72
+ fine-tuning Whisper. If you wish to fine-tune Whisper through Google Colab, you can skip ahead to the section: [Data and Pre-Processing](#data-and-pre-processing).
73
+
74
+ If you do not have access to a local GPU or Colab Pro/Pro+, we'll endeavour to provide you with a cloud GPU instance.
75
+ We've partnered up with Lambda to provide cloud compute for this event. They'll be providing the latest NVIDIA A100
76
+ 40 GB GPUs, so you'll be loaded with some serious firepower! The Lambda API makes it easy to spin-up and launch
77
+ a GPU instance. In this section, we'll go through the steps for spinning up an instance one-by-one.
78
+
79
+ <p align="center" width="100%">
80
+ <img width="50%" src="https://raw.githubusercontent.com/sanchit-gandhi/codesnippets/main/hf_x_lambda.jpg">
81
+ </p>
82
+
83
+ This section is split into three parts:
84
+
85
+ 1. [Signing-Up with Lambda](#signing-up-with-lambda)
86
+ 2. [Creating a Cloud Instance](#creating-a-cloud-instance)
87
+ 3. [Deleting a Cloud Instance](#deleting-a-cloud-instance)
88
+
89
+ ### Signing-Up with Lambda
90
+
91
+ 1. Create an account with Lambda using your email address of choice: https://cloud.lambdalabs.com/sign-up. If you already have an account, skip to step 2.
92
+ 2. Using this same email address, email `cloud@lambdal.com` with the Subject line: `Lambda cloud account for HuggingFace Whisper event - payment authentication and credit request`.
93
+ 3. Each user who emails as above will receive $110 in credits (amounting to 100 hours of 1x A100 usage).
94
+ 4. Register a valid payment method with Lambda in order to redeem the credits (see instructions below).
95
+
96
+ 100 hours of 1x A100 usage should enable you to complete 5-10 fine-tuning runs. To redeem these credits, you will need to
97
+ authorise a valid payment method with Lambda. Provided that you remain within $110 of compute spending, your card **will not**
98
+ be charged 💸. Registering your card with Lambda is a mandatory sign-up step that we unfortunately cannot bypass. But we
99
+ reiterate: you will not be charged provided you remain within $110 of compute spending!
100
+
101
+ Follow steps 1-4 in the next section [Creating a Cloud Instance](#creating-a-cloud-instance) to register your
102
+ card. If you experience issues with registering your card, contact the Lambda team on Discord (see [Communications and Problems](#communication-and-problems)).
103
+
104
+ In order to maximise the free GPU hours you have available for training, we advise that you shut down GPUs when you are
105
+ not using them and closely monitor your GPU usage. We've detailed the steps you can follow to achieve this in [Deleting a Cloud Instance](#deleting-a-cloud-instance).
106
+
107
+ ### Creating a Cloud Instance
108
+ Estimated time to complete: 5 mins
109
+
110
+ *You can also follow our video tutorial to set up a cloud instance on Lambda* 👉️ [YouTube Video](https://www.youtube.com/watch?v=Ndm9CROuk5g&list=PLo2EIpI_JMQtncHQHdHq2cinRVk_VZdGW)
111
+
112
+ 1. Click the link: https://cloud.lambdalabs.com/instances
113
+ 2. You'll be asked to sign in to your Lambda account (if you haven't done so already).
114
+ 3. Once on the GPU instance page, click the purple button "Launch instance" in the top right.
115
+ 4. Verify a payment method if you haven't done so already. IMPORTANT: if you have followed the instructions in the previous section, you will have received $110 in GPU credits. Exceeding 100 hours of 1x A100 usage may incur charges on your credit card. Contact the Lambda team on Discord if you have issues authenticating your payment method (see [Communications and Problems](#communication-and-problems))
116
+ 5. Launching an instance:
117
+ 1. In "Instance type", select the instance type "1x A100 (40 GB SXM4)"
118
+ 2. In "Select region", select the region with availability closest to you.
119
+ 3. In "Select filesystem", select "Don't attach a filesystem".
120
+ 6. You will be asked to provide your public SSH key. This will allow you to SSH into the GPU device from your local machine.
121
+ 1. If you’ve not already created an SSH key pair, you can do so with the following command from your local device:
122
+ ```bash
123
+ ssh-keygen
124
+ ```
125
+ 2. You can find your public SSH key using the command:
126
+ ```bash
127
+ cat ~/.ssh/id_rsa.pub
128
+ ```
129
+ (Windows: `type C:UsersUSERNAME.sshid_rsa.pub` where `USERNAME` is the name of your user)
130
+ 4. Copy and paste the output of this command into the first text box
131
+ 5. Give your SSH key a memorable name (e.g. `sanchits-mbp`)
132
+ 6. Click "Add SSH Key"
133
+ 7. Select the SSH key from the drop-down menu and click "Launch instance"
134
+ 8. Read the terms of use and agree
135
+ 9. We can now see on the "GPU instances" page that our device is booting up!
136
+ 10. Once the device status changes to "✅ Running", click on the SSH login ("ssh ubuntu@..."). This will copy the SSH login to your clipboard.
137
+ 11. Now open a new command line window, paste the SSH login, and hit Enter.
138
+ 12. If asked "Are you sure you want to continue connecting?", type "yes" and press Enter.
139
+ 13. Great! You're now SSH'd into your A100 device! We're now ready to set up our Python environment!
140
+
141
+ You can see your total GPU usage from the Lambda cloud interface: https://cloud.lambdalabs.com/usage
142
+
143
+ Here, you can see the total charges that you have incurred since the start of the event. We advise that you check your
144
+ total on a daily basis to make sure that it remains below the credit allocation of $110. This ensures that you are
145
+ not inadvertently charged for GPU hours.
146
+
147
+ If you are unable to SSH into your Lambda GPU in step 11, there is a workaround that you can try. On the [GPU instances page](https://cloud.lambdalabs.com/instances),
148
+ under the column "Cloud IDE", click the button "Launch". This will launch a Jupyter Lab on your GPU which will be displayed in your browser. In the
149
+ top left-hand corner, click "File" -> "New" -> "Terminal". This will open up a new terminal window. You can use this
150
+ terminal window to set up your Python environment in the next section [Set Up an Environment](#set-up-an-environment).
151
+
152
+ ### Deleting a Cloud Instance
153
+
154
+ 100 1x A100 hours should provide you with enough time for 5-10 fine-tuning runs (depending on how long you train for
155
+ and which size models). To maximise the GPU time you have for training, we advise that you shut down GPUs over prolonged
156
+ periods of time when they are not in use. Leaving a GPU running accidentally over the weekend will incur 48 hours of
157
+ wasted GPU hours. That's nearly half of your compute allocation! So be smart and shut down your GPU when you're not training.
158
+
159
+ Creating an instance and setting it up for the first time may take up to 20 minutes. Subsequently, this process will
160
+ be much faster as you gain familiarity with the steps, so you shouldn't worry about having to delete a GPU and spinning one
161
+ up the next time you need one. You can expect to spin-up and delete 2-3 GPUs over the course of the fine-tuning event.
162
+
163
+
164
+ We'll quickly run through the steps for deleting a Lambda GPU. You can come back to these steps after you've
165
+ performed your first training run and you want to shut down the GPU:
166
+
167
+ 1. Go to the instances page: https://cloud.lambdalabs.com/instances
168
+ 2. Click the checkbox on the left next to the GPU device you want to delete
169
+ 3. Click the button "Terminate" in the top right-hand side of your screen (under the purple button "Launch instance")
170
+ 4. Type "erase data on instance" in the text box and press "ok"
171
+
172
+ Your GPU device is now deleted and will stop consuming GPU credits.
173
+
174
+ ## Set Up an Environment
175
+ Estimated time to complete: 5 mins
176
+
177
+ *Follow along our video tutorial detailing the set up* 👉️ [YouTube Video](https://www.youtube.com/playlist?list=PLo2EIpI_JMQtzC5feNpqQL7eToYKcOxYf)
178
+
179
+ The Whisper model should be fine-tuned using **PyTorch**, **🤗 Transformers**, and, **🤗 Datasets**. In this
180
+ section, we'll cover how to set up an environment with the required libraries. This section assumes that you are SSH'd
181
+ into your GPU device. This section does not apply if you are fine-tuning the Whisper model in a Google Colab.
182
+
183
+ If you are returning to this section having read through it previously and want to quickly set up an environment, you
184
+ can do so in one call by executing the following code cell. If this is your first time setting up an environment, we
185
+ recommend you read this section to understand the steps involved.
186
+
187
+ ```bash
188
+ sudo add-apt-repository -y ppa:jonathonf/ffmpeg-4
189
+ sudo apt update
190
+ sudo apt install -y ffmpeg
191
+
192
+ sudo apt-get install git-lfs
193
+
194
+ python3 -m venv hf_env
195
+ source hf_env/bin/activate
196
+ echo "source ~/hf_env/bin/activate" >> ~/.bashrc
197
+
198
+ git clone https://github.com/huggingface/community-events.git
199
+ pip install -r community-events/whisper-fine-tuning-event/requirements.txt
200
+
201
+ git config --global credential.helper store
202
+ huggingface-cli login
203
+ ```
204
+
205
+ ### Unix Libraries
206
+
207
+ First, we need to make sure we have the required NVIDIA drivers installed. We can check that we have these drivers
208
+ through the following command:
209
+
210
+ ```bash
211
+ nvidia-smi
212
+ ```
213
+
214
+ This should print a table with our NVIDIA driver version and CUDA version, and should work out of the box for Lambda GPUs!
215
+ If you get an error running this command, refer to your device manual for installing the required NVIDIA driver.
216
+
217
+ Before installing the required libraries, we'd need to install and update the Unix package `ffmpeg` to version 4:
218
+
219
+ ```bash
220
+ sudo add-apt-repository -y ppa:jonathonf/ffmpeg-4
221
+ sudo apt update
222
+ sudo apt install -y ffmpeg
223
+ ```
224
+
225
+ We'll also need the package `git-lfs` to push large model weights to the Hugging Face Hub. To check whether
226
+ `git-lfs` is installed, simply run:
227
+
228
+ ```bash
229
+ git-lfs -v
230
+ ```
231
+
232
+ The output should show something like `git-lfs/2.13.2 (GitHub; linux amd64; go 1.15.4)`. If your console states that
233
+ the `git-lfs` command was not found, you can install it via:
234
+
235
+ ```bash
236
+ sudo apt-get install git-lfs
237
+ ```
238
+
239
+ ### Python Libraries
240
+
241
+ We recommend installing the required libraries in a Python virtual environment. If you're unfamiliar with Python virtual
242
+ environments, check out the [official user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
243
+
244
+ Let's define a variable that denotes the name of the environment we're going to create:
245
+
246
+ ```bash
247
+ env_name=<your-venv-name>
248
+ ```
249
+
250
+ We can create a virtual environment (venv) with this name using the following command:
251
+
252
+ ```bash
253
+ python3 -m venv $env_name
254
+ ```
255
+
256
+ We'll instruct our bash shell to activate the venv by default by placing the venv source command in `.bashrc`:
257
+
258
+ ```bash
259
+ echo "source ~/$env_name/bin/activate" >> ~/.bashrc
260
+ ```
261
+
262
+ Re-launching the bash shell will activate the venv:
263
+
264
+ ```bash
265
+ bash
266
+ ```
267
+
268
+ Great! We can see that our venv name is at the start of our command line - this means that we're operating from
269
+ within the venv. We can now go ahead and start installing the required Python packages to our venv.
270
+
271
+ The [`requirements.txt`](https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/requirements.txt)
272
+ file in this directory has all the necessary Python packages we need to fine-tune Whisper, including PyTorch, Transformers
273
+ and Datasets. We'll install all the packages in this file through one `pip install` command.
274
+
275
+ First, let's clone the `community-events` repository to our device:
276
+
277
+ ```bash
278
+ git clone https://github.com/huggingface/community-events.git
279
+ ```
280
+
281
+ Now we can install the packages from the `requirements.txt` file using the following command:
282
+
283
+ ```bash
284
+ pip install -r community-events/whisper-fine-tuning-event/requirements.txt
285
+ ```
286
+
287
+ Note: when installing packages, you might see warnings such as:
288
+
289
+ ```bash
290
+ error: invalid command 'bdist_wheel'
291
+ ----------------------------------------
292
+ ERROR: Failed building wheel for audioread
293
+ ```
294
+
295
+ This is perfectly ok! It does not affect our installation.
296
+
297
+ We can check that above steps installed the correct version of PyTorch to match our CUDA version. The following command should return True:
298
+
299
+ ```python
300
+ python -c "import torch; print(torch.cuda.is_available())"
301
+ ```
302
+
303
+ If the above command does not return True, refer to the [official instructions](https://pytorch.org/get-started/locally/) for installing PyTorch.
304
+
305
+ We can now verify that `transformers` and `datasets` have been correctly installed. First, launch a Python shell:
306
+
307
+ ```bash
308
+ python
309
+ ```
310
+
311
+ Running the following code cell will load one sample of the [Common Voice](https://huggingface.co/datasets/common_voice)
312
+ dataset from the Hugging Face Hub and perform a forward pass of the "tiny" Whisper model:
313
+
314
+ ```python
315
+ import torch
316
+ from transformers import WhisperFeatureExtractor, WhisperForConditionalGeneration
317
+ from datasets import load_dataset
318
+
319
+ model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
320
+ feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
321
+
322
+ common_voice = load_dataset("common_voice", "en", split="validation", streaming=True)
323
+
324
+ inputs = feature_extractor(next(iter(common_voice))["audio"]["array"], sampling_rate=16000, return_tensors="pt")
325
+ input_features = inputs.input_features
326
+
327
+ decoder_input_ids = torch.tensor([[1, 1]]) * model.config.decoder_start_token_id
328
+ logits = model(input_features, decoder_input_ids=decoder_input_ids).logits
329
+
330
+ print("Environment set up successful?", logits.shape[-1] == 51865)
331
+
332
+ ```
333
+
334
+ If the final check returns True, the libraries have been installed correctly. Finally, exit the Python shell:
335
+
336
+ ```python
337
+ quit()
338
+ ```
339
+
340
+ The last thing we need to do is link our Hugging Face account. Run the command:
341
+
342
+ ```bash
343
+ git config --global credential.helper store
344
+ huggingface-cli login
345
+ ```
346
+
347
+ And then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not have
348
+ one already. You should make sure that this token has "write" privileges.
349
+
350
+ ## Data and Pre-Processing
351
+
352
+ In this section, we will cover how to find suitable training data and the necessary steps to pre-process it.
353
+ If you are new to the 🤗 Datasets library, we highly recommend reading the comprehensive blog post: [A Complete Guide To Audio Datasets](https://huggingface.co/blog/audio-datasets).
354
+ This blog post will tell you everything you need to know about 🤗 Datasets and its one-line API.
355
+
356
+ ### Data
357
+
358
+ Whisper models will be evaluated on the `"test"` split of the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
359
+ dataset. Any data can be used to fine-tune the Whisper model **except Common Voice's `"test"` split**. This exception
360
+ extends to all Common Voice versions, as the test split of legacy Common Voice releases often overlaps with the
361
+ latest one. For instance, the test split of Common Voice 10 is largely the same as that of Common Voice 11.
362
+
363
+ So, the test data:
364
+
365
+ ```python
366
+ load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", use_auth_token=True)
367
+ ```
368
+
369
+ More or less includes the same data as:
370
+
371
+ ```python
372
+ load_dataset("mozilla-foundation/common_voice_10_0", "en", split="test", use_auth_token=True)
373
+ ```
374
+
375
+ And **neither** are allowed for training purposes. However, we strongly encourage participants to make use of the other
376
+ Common Voice splits for training data, such as the `"train"` and `"validation"` splits:
377
+
378
+ ```python
379
+ load_dataset("mozilla-foundation/common_voice_10_0", "en", split="train", use_auth_token=True)
380
+ ```
381
+
382
+ For most languages, the `"train"` split of Common Voice 11 dataset offers a reasonable amount of training data.
383
+ For low-resource languages, it is normal procedure to combine the `"train"` and `"validation"` splits to give a larger
384
+ training corpus:
385
+
386
+ ```python
387
+ load_dataset("mozilla-foundation/common_voice_10_0", "en", split="train+validation", use_auth_token=True)
388
+ ```
389
+
390
+ This notation for combining splits (`"split_a+split_b"`) is consistent for all resources in the event. You can combine
391
+ splits in this same way using the fine-tuning scripts in the following section [Fine-Tune Whisper](#fine-tune-whisper).
392
+
393
+ If combining the `"train"` and `"validation"` splits of the Common Voice 11 dataset still gives insufficient training
394
+ data for your language, you can explore using other datasets on the Hub to train your model and try
395
+ [Mixing Datasets](#mixing-datasets-optional) to give larger training splits.
396
+
397
+ ### Streaming Mode
398
+
399
+ Audio datasets are very large. This causes two issues:
400
+ 1. They require a significant amount of **storage space** to download.
401
+ 2. They take a significant amount of **time** to download and process.
402
+
403
+ The storage and time requirements present limitations to most speech researchers. For example, downloading the English
404
+ subset of the Common Voice 11 dataset (2,300 hours) requires upwards of 200GB of disk space and up to several hours
405
+ of download time. For these reasons, we **do not** recommend that you run the following code cell!
406
+ ```python
407
+ from datasets import load_dataset
408
+
409
+ common_voice = load_dataset("mozilla-foundation/common_voice_11_0", "en", use_auth_token=True)
410
+
411
+ # we have to wait several hours until the entire dataset is downloaded before we can access the first sample...
412
+ print(next(iter(common_voice["train"])))
413
+ ```
414
+
415
+ However, both these issues can be solved with 🤗 Datasets. Rather than downloading the whole dataset at once, we
416
+ load individual samples as we cycle over the dataset, in a process called _streaming_. Since the data is loaded
417
+ progressively as we iterate over the dataset, we can get started with a dataset as soon as the first sample is ready.
418
+ This way, we don't have to wait for the entire dataset to download before we can run our code! We are also free of any
419
+ disk space contraints: once we're done with a sample, we discard it and load the next one to memory. This way, we only
420
+ have the data when we need it, and not when we don't!
421
+
422
+ Streaming is enabled by passing the argument `streaming=True` to the `load_dataset` function. We can then use our
423
+ audio datasets in much the same way as before! For these reasons, **we highly recommend** that you try out the following
424
+ code cell! Just make sure you've accepted the Common Voice 11 [terms of use](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) on the Hugging Face Hub.
425
+
426
+ ```python
427
+ from datasets import load_dataset
428
+
429
+ common_voice = load_dataset("mozilla-foundation/common_voice_11_0", "en", use_auth_token=True, streaming=True)
430
+
431
+ # get the first sample of the dataset straight away!
432
+ print(next(iter(common_voice["train"])))
433
+ ```
434
+
435
+ The examples for this event rely heavily on streaming mode to fine-tune Whisper. With streaming mode, we can use **any
436
+ speech recognition dataset on the Hub with just 20GB of disk space**. As a speech recognition practitioner, this is
437
+ game changing! The largest speech recognition datasets are available to us regardless of our device disk space. We are
438
+ extremely excited to be showcasing streaming mode in this event and hope that you will enjoy using it.
439
+
440
+ There is one caveat to streaming mode. When downloading a dataset to disk, the processed data is saved to our cache. If
441
+ we want to re-use this data, we can directly load the processed data from cache, skipping the download and processing
442
+ steps. Consequently, we only have to perform the downloading and processing operations once. With streaming mode, the
443
+ data is not downloaded to disk. Thus, neither the download nor pre-processing are cached. If we want to re-use the data,
444
+ the streaming steps must be repeated, with the audio files loaded and pre-processed again. Therefore, we recommend not
445
+ using streaming mode if your dataset is small (< 10 hours). In this case, it is faster to download and pre-process the
446
+ dataset in the conventional way once at the start, and then re-use it at each epoch. We provide pointers for disabling
447
+ streaming mode in the section [Fine-Tune Whisper](#fine-tune-whisper).
448
+
449
+ If you want to read more about streaming mode, we
450
+ recommend you check out the aforementioned blog post: [A Complete Guide To Audio Datasets](https://huggingface.co/blog/audio-datasets).
451
+
452
+ ### Pre-Processing
453
+
454
+ Data pre-processing is a very grey area when it comes to speech recognition. In this section, we'll try to make the
455
+ situation as clear as possible for you as participants.
456
+
457
+ The Common Voice dataset is both cased and punctuated:
458
+
459
+ ```python
460
+ print(next(iter(common_voice["train"]))["sentence"])
461
+ ```
462
+ **Print Output:**
463
+ ```
464
+ Why does Melissandre look like she wants to consume Jon Snow on the ride up the wall?
465
+ ```
466
+
467
+ If we train the Whisper model on the raw Common Voice dataset, it will learn to predict casing and punctuation. This is
468
+ great when we want to use out model for actual speech recognition applications, such as transcribing meetings or
469
+ dictation, as the predicted transcriptions will be formatted with casing and punctuation.
470
+
471
+ However, we also have the option of 'normalising' the dataset to remove any casing and punctuation. Normalising the
472
+ dataset makes the speech recognition task easier: the model no longer needs to distinguish between upper and lower case
473
+ characters, or have to predict punctuation from the audio data alone. Because of this, the word error rates are
474
+ naturally lower (meaning the results are better). The Whisper paper demonstrates the drastic effect that normalising
475
+ transcriptions can have on WER results (_c.f._ Section 4.4 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).
476
+ But while we get lower WERs, we can't necessarily use our model in production. The lack of casing and punctuation makes
477
+ the predicted text from the model much harder to read. We would need additional post-processing models to restore casing and
478
+ punctuation in our predictions if we wanted to use it for downstream applications.
479
+
480
+ There is a happy medium between the two: we can train our systems on cased and normalised transcriptions, and then
481
+ evaluate them on normalised text. This way, we train our systems to predict fully formatted text, but also benefit from
482
+ the WER improvements we get by normalising the transcriptions.
483
+
484
+ The choice of whether you normalise the transcriptions is ultimately down to you. We recommend training on un-normalised
485
+ text and evaluating on normalised text to get the best of both worlds. Since those choices are not always obvious, feel
486
+ free to ask on Discord or (even better) post your question on the [forum](https://discuss.huggingface.co).
487
+
488
+ | Train | Eval | Pros | Cons |
489
+ |---------------|---------------|----------------------------------------------------------------|------------------------------------------|
490
+ | Un-normalised | Un-normalised | * Predict casing + punctuation<br>* One logic for train / eval | * WERs are higher |
491
+ | Un-normalised | Normalised | * Predict casing + punctuation<br>* WERs are lower | * Different logic for train / eval |
492
+ | Normalised | Normalised | * One logic for train / eval<br>* WERs are lower | * No casing / punctuation in predictions |
493
+
494
+ With the provided training scripts, it is trivial to toggle between removing or retaining punctuation and casing,
495
+ requiring at most three lines of code change. Switching between the different modes is explained in more detail in the
496
+ following section [Fine-Tune Whisper](#fine-tune-whisper).
497
+
498
+ If you want to find out more about pre- and post-processing for speech recognition, we refer you in the direction of
499
+ the paper: [ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition](https://arxiv.org/abs/2210.13352).
500
+
501
+ The following two subsections are optional. They cover how you can mix datasets to form larger training splits and how
502
+ you can use custom data to fine-tune your model. If the Common Voice 11 dataset has sufficient data in your language to
503
+ fine-tune your model, you can skip to the next section [Fine-Tune Whisper](#fine-tune-whisper).
504
+
505
+ ### Mixing Datasets (optional)
506
+
507
+ If the Common Voice 11 dataset contains insufficient training data to fine-tune Whisper in your language, you can explore mixing
508
+ different datasets to create a larger combined training set. Incorporating supplementary training data is almost always beneficial for training.
509
+ The Whisper paper demonstrates the significant effect that increasing the amount of training data can have on downstream
510
+ performance (_c.f._ Section 4.2 of the [paper](https://cdn.openai.com/papers/whisper.pdf)). There are a number of datasets
511
+ that are available on the Hugging Face Hub that can be downloaded via the 🤗 Datasets library in much the same way as
512
+ Common Voice 11.
513
+
514
+ We recommend selecting from the following four datasets on the Hugging Face Hub for multilingual speech recognition:
515
+
516
+ | Dataset | Languages | Casing | Punctuation |
517
+ |-----------------------------------------------------------------------------------------------|-----------|--------|-------------|
518
+ | [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) | 100+ | ✅ | ✅ |
519
+ | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 15 | ❌ | ✅ |
520
+ | [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) | 6 | ❌ | ❌ |
521
+ | [FLEURS](https://huggingface.co/datasets/google/fleurs) | 100+ | ✅ | ✅ |
522
+
523
+
524
+ <!---
525
+ <details>
526
+ <summary>
527
+
528
+ #### Common Voice 11
529
+
530
+ </summary>
531
+
532
+ [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) is a crowd-sourced
533
+ open-licensed speech dataset where speakers record text from Wikipedia in various languages. Since anyone can contribute
534
+ recordings, there is significant variation in both audio quality and speakers. The audio conditions are challenging, with
535
+ recording artefacts, accented speech, hesitations, and the presence of foreign words. The transcriptions are both cased
536
+ and punctuated. As of version 11, there are over 100 languages available, both low and high-resource.
537
+ </details>
538
+ <details>
539
+ <summary>
540
+
541
+ #### VoxPopuli
542
+
543
+ </summary>
544
+
545
+ [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) is a large-scale multilingual speech corpus consisting
546
+ of data sourced from 2009-2020 European Parliament event recordings. Consequently, it occupies the unique domain of
547
+ oratory, political speech, largely sourced from non-native speakers. It contains labelled audio-transcription data for
548
+ 15 European languages. The transcriptions are punctuated but not cased.
549
+ </details>
550
+ <details>
551
+ <summary>
552
+
553
+ #### Multilingual LibriSpeech
554
+
555
+ </summary>
556
+
557
+ [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) is the multilingual
558
+ equivalent of the [LibriSpeech ASR](https://huggingface.co/datasets/librispeech_asr) corpus. It comprises a large corpus
559
+ of read audiobooks taken from the [LibriVox](https://librivox.org/) project, making it a suitable dataset for academic
560
+ research. It contains data split into eight high-resource languages - English, German, Dutch, Spanish, French, Italian,
561
+ Portuguese and Polish. The transcriptions are neither punctuated nor cased.
562
+ </details>
563
+ <details>
564
+ <summary>
565
+
566
+ #### FLEURS
567
+
568
+ </summary>
569
+
570
+ [FLEURS](https://huggingface.co/datasets/google/fleurs) (Few-shot Learning Evaluation of Universal Representations of
571
+ Speech) is a dataset for evaluating speech recognition systems in 102 languages, including many that are classified as
572
+ 'low-resource'. The data is derived from the [FLoRes-101](https://arxiv.org/abs/2106.03193) dataset, a machine
573
+ translation corpus with 3001 sentence translations from English to 101 other languages. Native speakers are recorded
574
+ narrating the sentence transcriptions in their native language. The recorded audio data is paired with the sentence
575
+ transcriptions to yield a multilingual speech recognition over all 101 languages. The training sets contain
576
+ approximately 10 hours of supervised audio-transcription data per language. Transcriptions come in two formats: un-normalised
577
+ (`"raw_transcription"`) and normalised (`"transcription"`).
578
+ </details>
579
+
580
+ The previously mentioned blog post provides a more in-depth explanation of the main English speech recognition,
581
+ multilingual speech recognition and speech translation datasets on the Hub: [A Complete Guide To Audio Datasets](https://huggingface.co/blog/audio-datasets#a-tour-of-audio-datasets-on-the-hub)
582
+
583
+ You can also explore all speech recognition datasets on the Hub to find one suited for your language and needs: [ASR datasets on the Hub](https://huggingface.co/datasets?task_categories=task_categories:automatic-speech-recognition&sort=downloads).
584
+ --->
585
+
586
+ You can try training on these datasets individually, or mix them to form larger train sets.
587
+
588
+ When mixing datasets, you should ensure the transcription format is consistent across datasets. For example, if you mix
589
+ Common Voice 11 (cased + punctuated) with VoxPopuli (un-cased + punctuated), you will need to lower-case **all the text**
590
+ for both training and evaluation, such that the transcriptions are consistent across training samples (un-cased + punctuated).
591
+
592
+ Likewise, if mixing Common Voice 11 (cased + punctuated) with Multilingual LibriSpeech (un-cased + un-punctuated), you
593
+ should make sure to remove all casing and punctuation in **all the text** for both training and evaluation, such that
594
+ all transcriptions are un-cased and un-punctuated for all training samples.
595
+
596
+ Having a mismatch in formatting for different training samples can reduce the final performance of your fine-tuned Whisper
597
+ model.
598
+
599
+ If you want to combine multiple datasets for training, you can refer to the code-snippet provided for interleaving
600
+ datasets with streaming mode: [interleave_streaming_datasets.ipynb](https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/interleave_streaming_datasets.ipynb).
601
+
602
+ ### Custom Data (optional)
603
+
604
+ In addition to publicly available data on the Hugging Face Hub, participants can also make use of their own audio data
605
+ for training. When using your own audio data, please make sure that you **are allowed to use the audio data**. For
606
+ instance, if the audio data is taken from media platforms, such as YouTube, please verify that the media platform and
607
+ the owner of the data have given their approval to use the audio data in the context of machine learning research. If
608
+ you are not sure whether the data you want to use has the appropriate licensing, please contact the Hugging Face team
609
+ on Discord.
610
+
611
+ <!--- TODO: VB - tutorial for adding own data via audio folder --->
612
+
613
+ ## Fine-Tune Whisper
614
+
615
+ Throughout the event, participants are encouraged to leverage the official pre-trained [Whisper checkpoints](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads&search=whisper).
616
+ The Whisper checkpoints come in five configurations of varying model sizes.
617
+ The smallest four are trained on either English-only or multilingual data.
618
+ The largest checkpoint is multilingual only. The checkpoints are summarised in the following table with links to the
619
+ models on the Hugging Face Hub:
620
+
621
+ | Size | Layers | Width | Heads | Parameters | English-only | Multilingual |
622
+ |--------|--------|-------|-------|------------|------------------------------------------------------|---------------------------------------------------|
623
+ | tiny | 4 | 384 | 6 | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) |
624
+ | base | 6 | 512 | 8 | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |
625
+ | small | 12 | 768 | 12 | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |
626
+ | medium | 24 | 1024 | 16 | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |
627
+ | large | 32 | 1280 | 20 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |
628
+
629
+ The English-only checkpoints should be used for English speech recognition. For all other languages, one should use the
630
+ multilingual checkpoints.
631
+
632
+ We recommend using the tiny model for rapid prototyping. **We advise that the small or medium checkpoints are used for
633
+ fine-tuning**. These checkpoints achieve comparable performance to the large checkpoint, but can be trained much faster
634
+ (and hence for much longer!).
635
+
636
+ A complete guide to Whisper fine-tuning can be found in the blog post: [Fine-Tune Whisper with 🤗 Transformers](https://huggingface.co/blog/fine-tune-whisper).
637
+ While it is not necessary to have read this blog post before fine-tuning Whisper, it is strongly advised to gain
638
+ familiarity with the fine-tuning code.
639
+
640
+ There are three ways in which you can execute the fine-tuning code:
641
+ 1. [Python Script](#python-script)
642
+ 2. [Jupyter Notebook](#jupyter-notebook)
643
+ 3. [Google Colab](#google-colab)
644
+
645
+ 1 and 2 are applicable when running on a local GPU or cloud GPU instance (such as on Lambda). 3 applies if you have
646
+ a Google Colab Pro/Pro+ subscription and want to run training in a Google Colab. The proceeding instructions for running
647
+ each of these methods are quite lengthy. Feel free to read through each of them to get a better idea for which one you
648
+ want to use for training. Once you've read through, we advise you pick one method and stick to it!
649
+
650
+ For the walk-through, we'll assume that we're fine-tuning the Whisper model on Spanish ("es") on the Common Voice 11
651
+ dataset. We'll point out where you'll need to change variables to run the script for your language of choice.
652
+
653
+ Before jumping into any training, make sure you've accepted the Common Voice 11 [terms of use](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
654
+ on the Hugging Face Hub.
655
+
656
+ ### Python Script
657
+ *Checkout the video tutorial detailing how to fine-tune your whisper model via the CLI* 👉️ [YouTube Video](https://www.youtube.com/playlist?list=PLo2EIpI_JMQuKpnFm1ntcLKP6gq0l0f1Q)
658
+
659
+ 1. **Create a model repository**
660
+
661
+ The steps for running training with a Python script assume that you are SSH'd into your GPU device and have set up
662
+ your environment according to the previous section [Set Up an Environment](#set-up-an-environment).
663
+
664
+ First, we need to create a model repository on the Hugging Face Hub. This repository will contain all the required files
665
+ to reproduce the training run, alongside model weights, training logs and a README.md card. You can either create a model
666
+ repository directly on the Hugging Face Hub using the link: https://huggingface.co/new Or, via the CLI. Here, we'll show
667
+ how to use the CLI.
668
+
669
+ Let's pick a name for our fine-tuned Whisper model: *whisper-small-es*. We can run the following command to create a
670
+ repository under this name.
671
+
672
+ ```bash
673
+ huggingface-cli repo create whisper-small-es
674
+ ```
675
+ (change "es" to your language code)
676
+
677
+ We can now see the model on the Hub, *e.g.* under https://huggingface.co/sanchit-gandhi/whisper-small-es
678
+
679
+ Let's clone the repository so that we can place our training script and model weights inside:
680
+
681
+ ```bash
682
+ git lfs install
683
+ git clone https://huggingface.co/sanchit-gandhi/whisper-small-es
684
+ ```
685
+
686
+ (be sure to change the repo address to `https://huggingface.co/<your-user-name>/<your-repo-name>`)
687
+
688
+ We can then enter the repository using the `cd` command:
689
+
690
+ ```bash
691
+ cd whisper-small-es
692
+ ```
693
+
694
+ 2. **Add training script and `run` command**
695
+
696
+ We encourage participants to add all the relevant files for training directly to the model repository. This way,
697
+ training runs are fully reproducible.
698
+
699
+ We provide a Python training script for fine-tuning Whisper with 🤗 Datasets' streaming mode: [`run_speech_recognition_seq2seq_streaming.py`](https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/run_speech_recognition_streaming.py)
700
+ This script can be copied to your model repository with the following command:
701
+
702
+ ```bash
703
+ cp ~/community-events/whisper-fine-tuning-event/run_speech_recognition_seq2seq_streaming.py .
704
+ ```
705
+
706
+ This will download a copy of the training script to your model repository.
707
+
708
+ We can then define the model, training and data arguments for fine-tuning:
709
+
710
+ ```bash
711
+ echo 'python run_speech_recognition_seq2seq_streaming.py \
712
+ --model_name_or_path="openai/whisper-small" \
713
+ --dataset_name="mozilla-foundation/common_voice_11_0" \
714
+ --dataset_config_name="es" \
715
+ --language="spanish" \
716
+ --train_split_name="train+validation" \
717
+ --eval_split_name="test" \
718
+ --model_index_name="Whisper Small Spanish" \
719
+ --max_steps="5000" \
720
+ --output_dir="./" \
721
+ --per_device_train_batch_size="64" \
722
+ --per_device_eval_batch_size="32" \
723
+ --logging_steps="25" \
724
+ --learning_rate="1e-5" \
725
+ --warmup_steps="500" \
726
+ --evaluation_strategy="steps" \
727
+ --eval_steps="1000" \
728
+ --save_strategy="steps" \
729
+ --save_steps="1000" \
730
+ --generation_max_length="225" \
731
+ --length_column_name="input_length" \
732
+ --max_duration_in_seconds="30" \
733
+ --text_column_name="sentence" \
734
+ --freeze_feature_encoder="False" \
735
+ --report_to="tensorboard" \
736
+ --metric_for_best_model="wer" \
737
+ --greater_is_better="False" \
738
+ --load_best_model_at_end \
739
+ --gradient_checkpointing \
740
+ --fp16 \
741
+ --overwrite_output_dir \
742
+ --do_train \
743
+ --do_eval \
744
+ --predict_with_generate \
745
+ --do_normalize_eval \
746
+ --streaming \
747
+ --use_auth_token \
748
+ --push_to_hub' >> run.sh
749
+ ```
750
+
751
+ Make sure to change the `--dataset_config_name` and `--language` to the correct values for your language! See also how
752
+ we combine the train and validation splits as `--train_split_name="train+validation"`. This is recommended for low-resource
753
+ languages (it probably isn't strictly necessary for Spanish, where the `"train"` split for Common Voice 11 contains
754
+ ample training data). We also assign a `"model_index_name"` - a pretty name that will go on the model card. If you are
755
+ training on a very small dataset (< 10 hours), it is advisable to disable streaming mode: `--streaming="False"`.
756
+
757
+ We provide the train/eval batch sizes for the "small" checkpoint fine-tuned on a 1x A100 device. Depending on your device and checkpoint,
758
+ you might need to lower these values. Refer to the subsection [Recommended Training Configurations](#recommended-training-configurations)
759
+ for suggested batch-sizes for other devices and checkpoints.
760
+
761
+ 3. **Launch training 🚀**
762
+
763
+ We recommend running training through a `tmux` session. This means that training won't be interrupted when you close
764
+ your SSH connection. To start a `tmux` session named `mysession`:
765
+
766
+ ```bash
767
+ tmux new -s mysession
768
+ ```
769
+ (if `tmux` is not installed, you can install it through: `sudo apt-get install tmux`)
770
+
771
+ Once in the `tmux` session, we can launch training:
772
+
773
+ ```bash
774
+ bash run.sh
775
+ ```
776
+
777
+ Training should take approximately 8 hours, with a final cross-entropy loss of **1e-4** and word error rate of **32.6%**.
778
+
779
+ Since we're in a `tmux` session, we're free to close our SSH window without stopping training!
780
+
781
+ If you close your SSH connection and want to rejoin the `tmux` window, you can SSH into your GPU and then connect to
782
+ your session with the following command:
783
+
784
+ ```bash
785
+ tmux a -t mysession
786
+ ```
787
+
788
+ It will be like you never left!
789
+
790
+ `tmux` guide: https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/
791
+
792
+ ### Jupyter Notebook
793
+ *We've detailed these steps in a video tutorial to help you get up to speed faster* 👉️ [YouTube Video](https://www.youtube.com/playlist?list=PLo2EIpI_JMQs9z-N4v8L_Jb4KF6kAkylX)
794
+
795
+ 1. **SSH port forwarding**
796
+
797
+ The steps for running training with a Python script assume that you have set up your environment according to the
798
+ previous section [Set Up an Environment](#set-up-an-environment) and are **not** SSH'd into your GPU device. If you are
799
+ SSH'd into your GPU device, you can close this SSH window and start from your local machine.
800
+
801
+ The command to SSH into our GPU looked something as follows:
802
+
803
+ ```bash
804
+ ssh ubuntu@104.171.202.236
805
+ ```
806
+
807
+ When running a Jupyter Notebook, we need to "forward" the SSH port from the remote port to the local one. This amounts
808
+ to adding `-L 8888:localhost:8888` to the end of our SSH command. We can SSH into our remote machine using this modified
809
+ SSH command:
810
+
811
+ ```bash
812
+ ssh ubuntu@104.171.202.236 -L 8888:localhost:8888
813
+ ```
814
+
815
+ Be sure to change the `ssh ubuntu@...` part to your corresponding SSH command, it's simply the `-L 8888:localhost:8888`
816
+ part added onto the end that is new. If you want to find out more about SSH port forwarding, we recommend you read the guide:
817
+ [SSH/OpenSSH/PortForwarding](https://help.ubuntu.com/community/SSH/OpenSSH/PortForwarding).
818
+
819
+ 2. **Create a model repository (copied from previous subsection [Python Script](#python-script))**
820
+
821
+ First, we need to create a model repository on the Hugging Face Hub. This repository will contain all the required files
822
+ to reproduce the training run, alongside model weights, training logs and a README.md card.
823
+
824
+ You can either create a model repository directly on the Hugging Face Hub using the link: https://huggingface.co/new
825
+ Or, via the CLI. Here, we'll show how to use the CLI.
826
+
827
+ Let's pick a name for our fine-tuned Whisper model: *whisper-small-es*. We can run the following command to create a
828
+ repository under this name.
829
+
830
+ ```bash
831
+ huggingface-cli repo create whisper-small-es
832
+ ```
833
+ (change "es" to your language code)
834
+
835
+ We can now see the model on the Hub, *e.g.* under https://huggingface.co/sanchit-gandhi/whisper-small-es
836
+
837
+ Let's clone the repository so that we can place our training script and model weights inside:
838
+
839
+ ```bash
840
+ git lfs install
841
+ git clone https://huggingface.co/sanchit-gandhi/whisper-small-es
842
+ ```
843
+
844
+ (be sure to change the repo address to `https://huggingface.co/<your-user-name>/<your-repo-name>`)
845
+
846
+ We can then enter the repository using the `cd` command:
847
+
848
+ ```bash
849
+ cd whisper-small-es
850
+ ```
851
+
852
+ 3. **Add notebook**
853
+
854
+ We encourage participants to add all the training notebook directly to the model repository. This way,
855
+ training runs are fully reproducible.
856
+
857
+ We provide an iPython notebook for fine-tuning Whisper with 🤗 Datasets' streaming mode: [`fine-tune-whisper-streaming.ipynb`](https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine-tune-whisper-streaming.ipynb)
858
+ This notebook can be copied to your model repository with the following command:
859
+
860
+ ```bash
861
+ cp ~/community-events/whisper-fine-tuning-event/fine-tune-whisper-streaming.ipynb .
862
+ ```
863
+
864
+ If you are fine-tuning Whisper on a very small dataset (< 10 hours), it is advised that you use the non-streaming notebook
865
+ [`fine-tune-whisper-non-streaming.ipynb`](https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine-tune-whisper-non-streaming.ipynb)
866
+ (see section [Streaming Mode](#streaming-mode)). This notebook can be copied to your model repository with the following
867
+ command:
868
+
869
+ ```bash
870
+ cp ~/community-events/whisper-fine-tuning-event/fine-tune-whisper-non-streaming.ipynb .
871
+ ```
872
+
873
+ 4. **Launch Jupyter**
874
+
875
+ First, we need to make sure `jupyterlab` is installed:
876
+
877
+ ```bash
878
+ pip install jupyterlab
879
+ ```
880
+
881
+ We can then link `jupyter lab` to our venv:
882
+ ```bash
883
+ python -m ipykernel install --user --name=<your-venv-name>
884
+ ```
885
+
886
+ We recommend running training through a `tmux` session. This means that training won't be interrupted when you close
887
+ your SSH connection. To start a `tmux` session named `mysession`:
888
+
889
+ ```bash
890
+ tmux new -s mysession
891
+ ```
892
+ (if `tmux` is not installed, you can install it through: `sudo apt-get install tmux`)
893
+
894
+ Once in the `tmux` session, we can launch `jupyter lab`:
895
+
896
+ ```bash
897
+ jupyter lab --port 8888
898
+ ```
899
+
900
+ 5. **Open Jupyter in browser**
901
+
902
+ Now, this is the hardest step of running training from a Jupyter Notebook! Open a second terminal window on your local
903
+ machine and SSH into your GPU again. This time, it doesn't matter whether we include the `-L 8888:localhost:8888` part,
904
+ the important thing is that you re-enter your GPU device in a new SSH window.
905
+
906
+ Once SSH'd into your GPU, view all running `jupyter lab` sessions:
907
+
908
+ ```bash
909
+ jupyter lab list
910
+ ```
911
+
912
+ Copy the URL for the lab corresponding to port 8888 your clipboard, it will take the form `http://localhost:8888/?token=...`.
913
+ On your local desktop, open a web browser window (Safari, Firefox, Chrome, etc.). Paste the URL into the browser web
914
+ address bar and press Enter.
915
+
916
+ Voilà! We're now running a Jupyter Notebook on our GPU machine through the web browser on our local device!
917
+
918
+ 6. **Open fine-tuning notebook**
919
+
920
+ We can use the file explorer on the left to go to our model repository and open the Jupyter notebook `fine_tune_whisper_streaming.ipynb`.
921
+ In the top right of the notebook, you'll see a small window that says "Python 3". Clicking on this window will open a
922
+ dropdown menu, from which we can select a Python kernel. Select your venv from this dropdown menu. This will ensure that
923
+ you run the notebook in the venv we previously set up.
924
+
925
+ You can now run this notebook from start to finish and fine-tune the Whisper model as you desire 🤗 The notebook
926
+ contains pointers for where you need to change variables for your language.
927
+
928
+ Since we're operating within a `tmux` session, we're free to close our SSH connection and browser window when we desire.
929
+ Training won't be interrupted by closing this window. However, the notebook will cease to update, so you should make
930
+ sure that training is working before closing the notebook. You can monitor training progress through your model repo
931
+ on the Hugging Face Hub under the "Training Metrics" tab.
932
+
933
+ ### Google Colab
934
+ The Google Colab for fine-tuning Whisper is entirely self-contained. No need to set up an environment or sping up a GPU.
935
+ You can access it through the following link:
936
+
937
+ <a target="_blank" href="https://colab.research.google.com/github/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine_tune_whisper_streaming_colab.ipynb">
938
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
939
+ </a>
940
+
941
+ ### Recommended Training Configurations
942
+
943
+ In this section, we provide guidance for appropriate training and evaluation batch sizes depending on your GPU device.
944
+ Since the Whisper model expects log-Mel input features of a fixed dimension, the GPU memory required by the models is
945
+ the same for audio samples of any length. Thus, these recommendations should stand for all 16/40GB GPU devices. However,
946
+ if you experience out-of-memory errors, we recommend reducing the `per_device_train_batch_size` by factors of 2 and
947
+ increasing the `gradient_accumulation_steps` to compensate.
948
+
949
+ If you want to explore methods for reducing the memory of the Whisper model, check out the section [Tips and Tricks](#tips-and-tricks).
950
+
951
+ #### V100 / 16 GB GPU
952
+
953
+ | Model | Train Batch Size | Gradient Acc Steps | Eval Batch size |
954
+ |--------|------------------|--------------------|-----------------|
955
+ | small | 16 | 2 | 8 |
956
+ | medium | 2 | 16 | 1 |
957
+
958
+ It is advised to run the "small" checkpoint if training on a V100 device. Running the medium checkpoint will take
959
+ upwards of 12 hours for 5k training steps. We reckon you're better off training the "small" checkpoint for longer!
960
+
961
+ #### A100 / 40GB GPU
962
+
963
+ | Model | Train Batch Size | Gradient Acc Steps | Eval Batch size |
964
+ |--------|------------------|--------------------|-----------------|
965
+ | small | 64 | 1 | 32 |
966
+ | medium | 32 | 1 | 16 |
967
+
968
+ ### Punctuation, Casing and Normalisation
969
+
970
+ When using the Python training script, removing casing for the training data is enabled by passing the flag `--do_lower_case`.
971
+ Removing punctuation in the training data is achieved by passing the flag `--do_remove_punctuation`. Both of these flags
972
+ default to False, and we **do not** recommend setting either of them to True. This will ensure your fine-tuned model
973
+ learns to predict casing and punctuation. Normalisation is only applied during evaluation by setting the flag
974
+ `--do_normalize_eval` (which defaults to True and recommend setting). Normalisation is performed according to the
975
+ 'official' Whisper normaliser. This normaliser applies the following basic standardisation for non-English text:
976
+ 1. Remove any phrases between matching brackets ([, ]).
977
+ 2. Remove any phrases between matching parentheses ((, )).
978
+ 3. Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each character in the NFKC-normalized string starts with M, S, or P.
979
+ 4. Make the text lowercase.
980
+ 5. Replace any successive whitespace characters with a space.
981
+
982
+ Similarly, in the notebooks, removing casing in the training data is enabled by setting the variable `do_lower_case = True`,
983
+ and punctuation by `do_remove_punctuation = True`. We do not recommend setting either of these to True to ensure that
984
+ your model learns to predict casing and punctuation. Thus, they are set to False by default. Normalisation is only
985
+ applied during evaluation by setting the variable `do_normalize_eval=True` (which we do recommend setting).
986
+
987
+ ## Evaluation
988
+
989
+ We'll be running a live leaderboard throughout the event to track the best performing models across all languages. The leaderboard will track your models performance across *all* the speech recognition models available on the hub for your chosen language and dataset.
990
+
991
+ You can find the leaderboard [here](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=common_voice_11_0&only_verified=0&task=automatic-speech-recognition&config=th&split=train%2Bvalidation&metric=wer) 📈.
992
+
993
+ Each participant should evaluate their fine-tuned Whisper checkpoint on the `"test"` split of the Common Voice 11
994
+ dataset for their respective language. For languages that are not part of the Common Voice 11 dataset, please contact
995
+ the organisers on Discord so that we can work together to find suitable evaluation data.
996
+
997
+ We recommend running evaluation during training by setting your eval dataset to the `"test"` split of Common Voice 11.
998
+ We'll also provide you with a standalone evaluation script so that you can test your model after training on Common Voice
999
+ or other datasets of your choice.
1000
+
1001
+ In addition to running evaluation while training, you can noe use your Whisper checkpoints to run evaluation on *any* speech recognition dataset on the hub. The [run_eval_whisper_streaming.py](https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/run_eval_whisper_streaming.py) script loads your whisper checkpoints, runs batch inference on your specified dataset and returns the WER.
1002
+
1003
+ You can use the script as follows:
1004
+ ```bash
1005
+ python run_eval_whisper_streaming.py --model_id="openai/whisper-tiny" --dataset="google/fleurs" --config="ar_eg" --device=0 --language="ar"
1006
+ ```
1007
+
1008
+ The evaluation script can be customised with the following parameters:
1009
+ 1. `model_id` - Whisper model identifier e.g. `openai/whisper-tiny`
1010
+ 2. `dataset` - Dataset name to evaluate the `model_id` on. Default value: `mozilla-foundation/common_voice_11_0`
1011
+ 3. `config` - Config of the dataset. e.g. `'en'` for the English split of Common Voice
1012
+ 4. `split` - Split of the dataset. Default value: `test`
1013
+ 5. `batch_size` - Number of samples to go through each streamed batch for inference. Default value: `16`
1014
+ 6. `max_eval_samples` - Max number of samples to be evaluated from the dataset. Put a lower number e.g. `64` for testing this script. **Only use this for testing the script**
1015
+ 7. `streaming` - Whether you'd like to download the entire dataset or stream it during the evaluation. Default value: `True`
1016
+ 8. `language` - Language you want the `model_id` to transcribe the audio in.
1017
+ 9. `device` - The device to run the pipeline on. e.g. `0` for running on GPU 0. Default value: -1 for CPU.
1018
+
1019
+ ## Building a Demo
1020
+
1021
+ Finally, on to the fun part! Time to sit back and watch the model transcribe audio. We've created a [template Gradio demo](https://huggingface.co/spaces/whisper-event/whisper-demo)
1022
+ that you can use to showcase your fine-tuned Whisper model 📢
1023
+
1024
+ Click the link to duplicate the template demo to your account: https://huggingface.co/spaces/whisper-event/whisper-demo?duplicate=true
1025
+
1026
+ We recommend giving your space a similar name to your fine-tuned model (e.g. `whisper-demo-es`) and setting the visibility
1027
+ to "Public".
1028
+
1029
+ Once you've duplicated the Space to your account, click "Files and versions" -> "app.py" -> "edit". Change the model
1030
+ identifier to your fine-tuned model (line 9). Scroll to the bottom of the page and click "Commit changes to `main`".
1031
+ The demo will reboot, this time using your fine-tuned model. You can share this demo with your friends and family so
1032
+ that they can use the model that you've trained!
1033
+
1034
+ *Checkout our video tutorial to get a better understanding 👉️ [YouTube Video](https://www.youtube.com/watch?v=VQYuvl6-9VE)*
1035
+
1036
+ ## Communication and Problems
1037
+
1038
+ If you encounter any problems or have any questions, you should use one of the following platforms
1039
+ depending on your type of problem. Hugging Face is an "open-source-first" organisation, meaning
1040
+ that we'll try to solve all problems in the most public and transparent way possible so that everybody
1041
+ in the community benefits.
1042
+
1043
+ The following paragraph summarises the platform to use for each kind of problem:
1044
+
1045
+ - Problem/question/bug with the 🤗 Datasets library that you think is a general problem that also impacts other people, please open an [Issue on Datasets](https://github.com/huggingface/datasets/issues/new?assignees=&labels=bug&template=bug-report.md&title=) and ping @sanchit-gandhi and @vaibhavs10.
1046
+ - Problem/question/bug with the 🤗 Transformers library that you think is a general problem that also impacts other people, please open an [Issue on Transformers](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title=) and ping @sanchit-gandhi and @vaibhavs10.
1047
+ - Problem/question with a modified, customised training script that is less likely to impact other people, please post your problem/question [on the forum](https://discuss.huggingface.co/) and ping @sanchit-gandhi and @vaibhavs10.
1048
+ - Problem/question regarding access or set up of a Lambda GPU, please ask in the Discord channel **#lambdalabs-infra-support**.
1049
+ - Other questions regarding the event, rules of the event, or if you are unsure where to post your question, please ask in the Discord channel **#events**.
1050
+
1051
+ ## Talks
1052
+
1053
+ We are very excited to be hosting talks from Open AI, Meta AI and Hugging Face to help you get a better understanding of the Whisper model, the VoxPopuli dataset and details about the fine-tuning event itself!
1054
+
1055
+ | **Speaker** | **Topic** | **Time** | **Video** |
1056
+ |------------------------------|------------------------------------------------------------|-------------------------------|---------------------------------------------------------------------------------------------------------------------------|
1057
+ | Sanchit Gandhi, Hugging Face | Introduction to Whisper Fine-Tuning Event | 15:00 UTC, 2nd December, 2022 | [![Youtube](https://www.youtube.com/s/desktop/f506bd45/img/favicon_32.png)](https://www.youtube.com/watch?v=1cVBLOMlv3w) |
1058
+ | Jong Wook Kim, OpenAI | [Whisper Model](https://cdn.openai.com/papers/whisper.pdf) | 16:30 UTC, 5th December, 2022 | [![Youtube](https://www.youtube.com/s/desktop/f506bd45/img/favicon_32.png)](https://www.youtube.com/watch?v=fZMiD8sDzzg ) |
1059
+ | Changhan Wang, MetaAI | [VoxPopuli Dataset](https://arxiv.org/abs/2101.00390) | 17:30 UTC, 5th December, 2022 | [![Youtube](https://www.youtube.com/s/desktop/f506bd45/img/favicon_32.png)](https://www.youtube.com/watch?v=fZMiD8sDzzg ) |
1060
+
1061
+ ## Tips and Tricks
1062
+
1063
+ We include three memory saving tricks that you can explore to run the fine-tuning scripts with larger batch-sizes and
1064
+ potentially larger checkpoints.
1065
+
1066
+ ### Adam 8bit
1067
+ The [Adam optimiser](https://arxiv.org/abs/1412.6980a) requires two params (betas) for every model parameter. So the memory requirement of the optimiser is
1068
+ **two times** that of the model. You can switch to using an 8bit version of the Adam optimiser from [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes#bitsandbytes).
1069
+ This will cast the optimiser parameters into 8bit precision, saving you a lot of memory and potentially allowing you to run bigger batch sizes.
1070
+ To use Adam 8bit, you first need to pip install `bitsandbytes`:
1071
+
1072
+ ```bash
1073
+ pip install bitsandbytes
1074
+ ```
1075
+
1076
+ Then, set `optim="adamw_bnb_8bit"`, either in your `run.sh` file if running from a Python script, or when you
1077
+ instantiate the Seq2SeqTrainingArguments from a Jupyter Notebook or Google Colab:
1078
+
1079
+ ```python
1080
+ from transformers import Seq2SeqTrainingArguments
1081
+
1082
+ training_args = Seq2SeqTrainingArguments(
1083
+ output_dir="./",
1084
+ per_device_train_batch_size=64,
1085
+ gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
1086
+ learning_rate=1e-5,
1087
+ warmup_steps=500,
1088
+ max_steps=5000,
1089
+ gradient_checkpointing=True,
1090
+ fp16=True,
1091
+ evaluation_strategy="steps",
1092
+ per_device_eval_batch_size=8,
1093
+ predict_with_generate=True,
1094
+ generation_max_length=225,
1095
+ save_steps=1000,
1096
+ eval_steps=1000,
1097
+ logging_steps=25,
1098
+ report_to=["tensorboard"],
1099
+ load_best_model_at_end=True,
1100
+ metric_for_best_model="wer",
1101
+ greater_is_better=False,
1102
+ push_to_hub=True,
1103
+ optim="adamw_bnb_8bit"
1104
+ )
1105
+ ```
1106
+
1107
+ ### Adafactor
1108
+
1109
+ Rather than using Adam, you can use a different optimiser all together. Adam requires two optimiser params per one model
1110
+ param, but [Adafactor](https://arxiv.org/abs/1804.04235) uses only one. To enable Adafactor, set `optim="adafactor"` in the
1111
+ `Seq2SeqTrainingArguments`. You can expect to double your training batch size when using Adafactor compared to Adam.
1112
+
1113
+ A word of caution: Adafactor is untested for fine-tuning Whisper, so we are unsure sure how
1114
+ Adafactor performance compares to Adam! Typically, using Adafactor results in **slower convergence** than using Adam or
1115
+ Adam 8bit. For this reason, we recommend Adafactor as an **experimental feature** only.
1116
+
1117
+ ### DeepSpeed
1118
+
1119
+ DeepSpeed is a framework for training larger deep learning models with limited GPU resources by optimising GPU utilisation.
1120
+ We provide implementation details for DeepSpeed ZeRo Stage 2, which partitions the optimiser states (ZeRO stage 1) and gradients
1121
+ (ZeRO stage 2). With DeepSpeed, it is more than possible to train the medium Whisper checkpoint on a V100, or the large
1122
+ checkpoint on an A100. For more details, we refer you to the blog post by the original authors: [DeepSpeed ZeRO](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/).
1123
+
1124
+ Using DeepSpeed with 🤗 Transformers is straightforward. First, we need to install the packages 🤗 Accelerate and DeepSpeed:
1125
+
1126
+ ```bash
1127
+ pip install -U accelerate deepspeed
1128
+ ```
1129
+
1130
+ The DeepSpeed configuration file specifies precisely what form of optimiser/gradient offloading we are going to perform.
1131
+ The key to getting a huge improvement on a single GPU with DeepSpeed is to have at least the provided DeepSpeed configuration
1132
+ in the configuration file [`ds_config.json`](https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/ds_config.json).
1133
+
1134
+ You can copy the DeepSpeed configuration file to your model repository as follows:
1135
+
1136
+ ```bash
1137
+ cp ~/community-events/whisper-fine-tuning-event/ds_config.json .
1138
+ ```
1139
+
1140
+ ### Python Script
1141
+
1142
+ Using DeepSpeed with the Python training script requires two changes to the `run.sh` file. Firstly, we launch the script using `deepspeed`
1143
+ instead of Python. Secondly, we pass the DeepSpeed config `ds_config.json` as a training argument. The remainder of the `run.sh`
1144
+ file takes the same format as using the native Trainer configuration:
1145
+
1146
+ ```bash
1147
+ deepspeed run_speech_recognition_seq2seq_streaming.py \
1148
+ --deepspeed="ds_config.json" \
1149
+ --model_name_or_path="openai/whisper-small" \
1150
+ --dataset_name="mozilla-foundation/common_voice_11_0" \
1151
+ --dataset_config_name="es" \
1152
+ --language="spanish" \
1153
+ --train_split_name="train+validation" \
1154
+ --eval_split_name="test" \
1155
+ --model_index_name="Whisper Small Spanish" \
1156
+ --max_steps="5000" \
1157
+ --output_dir="./" \
1158
+ --per_device_train_batch_size="64" \
1159
+ --per_device_eval_batch_size="32" \
1160
+ --logging_steps="25" \
1161
+ --learning_rate="1e-5" \
1162
+ --warmup_steps="500" \
1163
+ --evaluation_strategy="steps" \
1164
+ --eval_steps="1000" \
1165
+ --save_strategy="steps" \
1166
+ --save_steps="1000" \
1167
+ --generation_max_length="225" \
1168
+ --length_column_name="input_length" \
1169
+ --max_duration_in_seconds="30" \
1170
+ --text_column_name="sentence" \
1171
+ --freeze_feature_encoder="False" \
1172
+ --report_to="tensorboard" \
1173
+ --metric_for_best_model="wer" \
1174
+ --greater_is_better="False" \
1175
+ --load_best_model_at_end \
1176
+ --gradient_checkpointing \
1177
+ --fp16 \
1178
+ --overwrite_output_dir \
1179
+ --do_train \
1180
+ --do_eval \
1181
+ --predict_with_generate \
1182
+ --do_normalize_eval \
1183
+ --streaming \
1184
+ --use_auth_token \
1185
+ --push_to_hub
1186
+ ```
1187
+
1188
+ ### Jupyter Notebook
1189
+
1190
+ Using DeepSpeed with the template Jupyter Notebooks requires two changes. Firstly, we add the following code cell at the
1191
+ start of the notebook to configure the DeepSpeed environment:
1192
+
1193
+ ```python
1194
+ # DeepSpeed requires a distributed environment even when only one process is used.
1195
+ # This emulates a launcher in the notebook
1196
+ import os
1197
+
1198
+ os.environ["MASTER_ADDR"] = "localhost"
1199
+ os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use
1200
+ os.environ["RANK"] = "0"
1201
+ os.environ["LOCAL_RANK"] = "0"
1202
+ os.environ["WORLD_SIZE"] = "1"
1203
+ ```
1204
+
1205
+ Secondly, we pass the DeepSpeed config file to the training args:
1206
+
1207
+ ```python
1208
+ training_args = Seq2SeqTrainingArguments(..., deepspeed="ds_config.json")
1209
+ ```
1210
+
1211
+ ### Recommended Batch Sizes with DeepSpeed
1212
+
1213
+ Using DeepSpeed, it is possible to fit larger batch sizes and even larger checkpoints on your device, be it a V100 or
1214
+ A100. We provide recommended batch sizes for the three checkpoint sizes of interest for 16GB GPUs and 40GB GPUs. As before,
1215
+ these batch sizes are only indicative: you should tune the batch size depending on your device, checkpoint and language.
1216
+
1217
+ #### V100 / 16 GB GPU
1218
+
1219
+ | Model | Train Batch Size | Gradient Acc Steps | Eval Batch size | Speed |
1220
+ |--------|------------------|--------------------|-----------------|---------|
1221
+ | small | 32 | 1 | 16 | 1.3s/it |
1222
+ | medium | 16 | 1 or 2 | 8 | 2.0s/it |
1223
+ | large | 8 | 2 or 4 | 4 | 3.8s/it |
1224
+
1225
+ #### A100 / 40GB GPU
1226
+
1227
+ | Model | Train Batch Size | Gradient Acc Steps | Eval Batch size | Speed |
1228
+ |--------|------------------|--------------------|-----------------|---------|
1229
+ | small | 64 | 1 | 32 | 2.3s/it |
1230
+ | medium | 64 | 1 | 32 | 5.8s/it |
1231
+ | large | 32 | 1 or 2 | 16 | 5.9s/it |
1232
+
1233
+
1234
+ ## Scripts & Colabs
1235
+
1236
+ 1. [Whirlwind tour of Whispering with 🤗Transformers](https://colab.research.google.com/drive/1l290cRv4RdvuLNlSeo9WexByHaNWs3s3?usp=sharing)
1237
+ 2. [8bit inference for Whisper large model (6.5 gig VRAM) 🤯](https://colab.research.google.com/drive/1EMOwwfm1V1fHxH7eT1LLg7yBjhTooB6j?usp=sharing)
1238
+
1239
+ <!--- TODO: VB - Move these colabs to a GitHub repo --->
1240
+
1241
+ ## Feedback
1242
+
1243
+ We would love to get your feedback on the event! If you have a spare ten minutes, we'd appreciate you filling out the
1244
+ feedback form at: https://forms.gle/7hvrTE8NaSdQwwU68
added_tokens.json ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<|af|>": 50327,
3
+ "<|am|>": 50334,
4
+ "<|ar|>": 50272,
5
+ "<|as|>": 50350,
6
+ "<|az|>": 50304,
7
+ "<|ba|>": 50355,
8
+ "<|be|>": 50330,
9
+ "<|bg|>": 50292,
10
+ "<|bn|>": 50302,
11
+ "<|bo|>": 50347,
12
+ "<|br|>": 50309,
13
+ "<|bs|>": 50315,
14
+ "<|ca|>": 50270,
15
+ "<|cs|>": 50283,
16
+ "<|cy|>": 50297,
17
+ "<|da|>": 50285,
18
+ "<|de|>": 50261,
19
+ "<|el|>": 50281,
20
+ "<|endoftext|>": 50257,
21
+ "<|en|>": 50259,
22
+ "<|es|>": 50262,
23
+ "<|et|>": 50307,
24
+ "<|eu|>": 50310,
25
+ "<|fa|>": 50300,
26
+ "<|fi|>": 50277,
27
+ "<|fo|>": 50338,
28
+ "<|fr|>": 50265,
29
+ "<|gl|>": 50319,
30
+ "<|gu|>": 50333,
31
+ "<|haw|>": 50352,
32
+ "<|ha|>": 50354,
33
+ "<|hi|>": 50276,
34
+ "<|hr|>": 50291,
35
+ "<|ht|>": 50339,
36
+ "<|hu|>": 50286,
37
+ "<|hy|>": 50312,
38
+ "<|id|>": 50275,
39
+ "<|is|>": 50311,
40
+ "<|it|>": 50274,
41
+ "<|iw|>": 50279,
42
+ "<|ja|>": 50266,
43
+ "<|jw|>": 50356,
44
+ "<|ka|>": 50329,
45
+ "<|kk|>": 50316,
46
+ "<|km|>": 50323,
47
+ "<|kn|>": 50306,
48
+ "<|ko|>": 50264,
49
+ "<|la|>": 50294,
50
+ "<|lb|>": 50345,
51
+ "<|ln|>": 50353,
52
+ "<|lo|>": 50336,
53
+ "<|lt|>": 50293,
54
+ "<|lv|>": 50301,
55
+ "<|mg|>": 50349,
56
+ "<|mi|>": 50295,
57
+ "<|mk|>": 50308,
58
+ "<|ml|>": 50296,
59
+ "<|mn|>": 50314,
60
+ "<|mr|>": 50320,
61
+ "<|ms|>": 50282,
62
+ "<|mt|>": 50343,
63
+ "<|my|>": 50346,
64
+ "<|ne|>": 50313,
65
+ "<|nl|>": 50271,
66
+ "<|nn|>": 50342,
67
+ "<|nocaptions|>": 50362,
68
+ "<|notimestamps|>": 50363,
69
+ "<|no|>": 50288,
70
+ "<|oc|>": 50328,
71
+ "<|pa|>": 50321,
72
+ "<|pl|>": 50269,
73
+ "<|ps|>": 50340,
74
+ "<|pt|>": 50267,
75
+ "<|ro|>": 50284,
76
+ "<|ru|>": 50263,
77
+ "<|sa|>": 50344,
78
+ "<|sd|>": 50332,
79
+ "<|si|>": 50322,
80
+ "<|sk|>": 50298,
81
+ "<|sl|>": 50305,
82
+ "<|sn|>": 50324,
83
+ "<|so|>": 50326,
84
+ "<|sq|>": 50317,
85
+ "<|sr|>": 50303,
86
+ "<|startoflm|>": 50360,
87
+ "<|startofprev|>": 50361,
88
+ "<|startoftranscript|>": 50258,
89
+ "<|su|>": 50357,
90
+ "<|sv|>": 50273,
91
+ "<|sw|>": 50318,
92
+ "<|ta|>": 50287,
93
+ "<|te|>": 50299,
94
+ "<|tg|>": 50331,
95
+ "<|th|>": 50289,
96
+ "<|tk|>": 50341,
97
+ "<|tl|>": 50348,
98
+ "<|transcribe|>": 50359,
99
+ "<|translate|>": 50358,
100
+ "<|tr|>": 50268,
101
+ "<|tt|>": 50351,
102
+ "<|uk|>": 50280,
103
+ "<|ur|>": 50290,
104
+ "<|uz|>": 50337,
105
+ "<|vi|>": 50278,
106
+ "<|yi|>": 50335,
107
+ "<|yo|>": 50325,
108
+ "<|zh|>": 50260
109
+ }
config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "openai/whisper-tiny",
3
+ "activation_dropout": 0.0,
4
+ "activation_function": "gelu",
5
+ "architectures": [
6
+ "WhisperForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "begin_suppress_tokens": [
10
+ 220,
11
+ 50257
12
+ ],
13
+ "bos_token_id": 50257,
14
+ "d_model": 384,
15
+ "decoder_attention_heads": 6,
16
+ "decoder_ffn_dim": 1536,
17
+ "decoder_layerdrop": 0.0,
18
+ "decoder_layers": 4,
19
+ "decoder_start_token_id": 50258,
20
+ "dropout": 0.0,
21
+ "encoder_attention_heads": 6,
22
+ "encoder_ffn_dim": 1536,
23
+ "encoder_layerdrop": 0.0,
24
+ "encoder_layers": 4,
25
+ "eos_token_id": 50257,
26
+ "forced_decoder_ids": null,
27
+ "init_std": 0.02,
28
+ "is_encoder_decoder": true,
29
+ "max_length": 448,
30
+ "max_source_positions": 1500,
31
+ "max_target_positions": 448,
32
+ "model_type": "whisper",
33
+ "num_hidden_layers": 4,
34
+ "num_mel_bins": 80,
35
+ "pad_token_id": 50257,
36
+ "scale_embedding": false,
37
+ "torch_dtype": "float32",
38
+ "transformers_version": "4.26.0.dev0",
39
+ "use_cache": true,
40
+ "vocab_size": 51865
41
+ }
ds_config.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fp16": {
3
+ "enabled": "auto",
4
+ "loss_scale": 0,
5
+ "loss_scale_window": 1000,
6
+ "initial_scale_power": 16,
7
+ "hysteresis": 2,
8
+ "min_loss_scale": 1
9
+ },
10
+
11
+ "optimizer": {
12
+ "type": "AdamW",
13
+ "params": {
14
+ "lr": "auto",
15
+ "betas": "auto",
16
+ "eps": "auto",
17
+ "weight_decay": "auto"
18
+ }
19
+ },
20
+
21
+ "scheduler": {
22
+ "type": "WarmupDecayLR",
23
+ "params": {
24
+ "last_batch_iteration": -1,
25
+ "total_num_steps": "auto",
26
+ "warmup_min_lr": "auto",
27
+ "warmup_max_lr": "auto",
28
+ "warmup_num_steps": "auto"
29
+ }
30
+ },
31
+
32
+ "zero_optimization": {
33
+ "stage": 2,
34
+ "offload_optimizer": {
35
+ "device": "cpu",
36
+ "pin_memory": true
37
+ },
38
+ "allgather_partitions": true,
39
+ "allgather_bucket_size": 2e8,
40
+ "overlap_comm": true,
41
+ "reduce_scatter": true,
42
+ "reduce_bucket_size": 2e8,
43
+ "contiguous_gradients": true
44
+ },
45
+
46
+ "gradient_accumulation_steps": "auto",
47
+ "gradient_clipping": "auto",
48
+ "train_batch_size": "auto",
49
+ "train_micro_batch_size_per_gpu": "auto"
50
+ }
fine-tune-whisper-non-streaming.ipynb ADDED
@@ -0,0 +1,1207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "75b58048-7d14-4fc6-8085-1fc08c81b4a6",
6
+ "metadata": {
7
+ "id": "75b58048-7d14-4fc6-8085-1fc08c81b4a6"
8
+ },
9
+ "source": [
10
+ "# Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers"
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "markdown",
15
+ "id": "fbfa8ad5-4cdc-4512-9058-836cbbf65e1a",
16
+ "metadata": {
17
+ "id": "fbfa8ad5-4cdc-4512-9058-836cbbf65e1a"
18
+ },
19
+ "source": [
20
+ "In this Colab, we present a step-by-step guide on how to fine-tune Whisper \n",
21
+ "for any multilingual ASR dataset using Hugging Face 🤗 Transformers. This is a \n",
22
+ "more \"hands-on\" version of the accompanying [blog post](https://huggingface.co/blog/fine-tune-whisper). \n",
23
+ "For a more in-depth explanation of Whisper, the Common Voice dataset and the theory behind fine-tuning, the reader is advised to refer to the blog post."
24
+ ]
25
+ },
26
+ {
27
+ "cell_type": "markdown",
28
+ "id": "afe0d503-ae4e-4aa7-9af4-dbcba52db41e",
29
+ "metadata": {
30
+ "id": "afe0d503-ae4e-4aa7-9af4-dbcba52db41e"
31
+ },
32
+ "source": [
33
+ "## Introduction"
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "markdown",
38
+ "id": "9ae91ed4-9c3e-4ade-938e-f4c2dcfbfdc0",
39
+ "metadata": {
40
+ "id": "9ae91ed4-9c3e-4ade-938e-f4c2dcfbfdc0"
41
+ },
42
+ "source": [
43
+ "Whisper is a pre-trained model for automatic speech recognition (ASR) \n",
44
+ "published in [September 2022](https://openai.com/blog/whisper/) by the authors \n",
45
+ "Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as \n",
46
+ "[Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained \n",
47
+ "on un-labelled audio data, Whisper is pre-trained on a vast quantity of \n",
48
+ "**labelled** audio-transcription data, 680,000 hours to be precise. \n",
49
+ "This is an order of magnitude more data than the un-labelled audio data used \n",
50
+ "to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this \n",
51
+ "pre-training data is multilingual ASR data. This results in checkpoints \n",
52
+ "that can be applied to over 96 languages, many of which are considered \n",
53
+ "_low-resource_.\n",
54
+ "\n",
55
+ "When scaled to 680,000 hours of labelled pre-training data, Whisper models \n",
56
+ "demonstrate a strong ability to generalise to many datasets and domains.\n",
57
+ "The pre-trained checkpoints achieve competitive results to state-of-the-art \n",
58
+ "ASR systems, with near 3% word error rate (WER) on the test-clean subset of \n",
59
+ "LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._ \n",
60
+ "Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).\n",
61
+ "The extensive multilingual ASR knowledge acquired by Whisper during pre-training \n",
62
+ "can be leveraged for other low-resource languages; through fine-tuning, the \n",
63
+ "pre-trained checkpoints can be adapted for specific datasets and languages \n",
64
+ "to further improve upon these results. We'll show just how Whisper can be fine-tuned \n",
65
+ "for low-resource languages in this Colab."
66
+ ]
67
+ },
68
+ {
69
+ "cell_type": "markdown",
70
+ "id": "e59b91d6-be24-4b5e-bb38-4977ea143a72",
71
+ "metadata": {
72
+ "id": "e59b91d6-be24-4b5e-bb38-4977ea143a72"
73
+ },
74
+ "source": [
75
+ "<figure>\n",
76
+ "<img src=\"https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/whisper_architecture.svg\" alt=\"Trulli\" style=\"width:100%\">\n",
77
+ "<figcaption align = \"center\"><b>Figure 1:</b> Whisper model. The architecture \n",
78
+ "follows the standard Transformer-based encoder-decoder model. A \n",
79
+ "log-Mel spectrogram is input to the encoder. The last encoder \n",
80
+ "hidden states are input to the decoder via cross-attention mechanisms. The \n",
81
+ "decoder autoregressively predicts text tokens, jointly conditional on the \n",
82
+ "encoder hidden states and previously predicted tokens. Figure source: \n",
83
+ "<a href=\"https://openai.com/blog/whisper/\">OpenAI Whisper Blog</a>.</figcaption>\n",
84
+ "</figure>"
85
+ ]
86
+ },
87
+ {
88
+ "cell_type": "markdown",
89
+ "id": "21b6316e-8a55-4549-a154-66d3da2ab74a",
90
+ "metadata": {
91
+ "id": "21b6316e-8a55-4549-a154-66d3da2ab74a"
92
+ },
93
+ "source": [
94
+ "The Whisper checkpoints come in five configurations of varying model sizes.\n",
95
+ "The smallest four are trained on either English-only or multilingual data.\n",
96
+ "The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints \n",
97
+ "are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The \n",
98
+ "checkpoints are summarised in the following table with links to the models on the Hub:\n",
99
+ "\n",
100
+ "| Size | Layers | Width | Heads | Parameters | English-only | Multilingual |\n",
101
+ "|--------|--------|-------|-------|------------|------------------------------------------------------|---------------------------------------------------|\n",
102
+ "| tiny | 4 | 384 | 6 | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny.) |\n",
103
+ "| base | 6 | 512 | 8 | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |\n",
104
+ "| small | 12 | 768 | 12 | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |\n",
105
+ "| medium | 24 | 1024 | 16 | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |\n",
106
+ "| large | 32 | 1280 | 20 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |\n",
107
+ "\n",
108
+ "For demonstration purposes, we'll fine-tune the multilingual version of the \n",
109
+ "[`\"small\"`](https://huggingface.co/openai/whisper-small) checkpoint with 244M params (~= 1GB). \n",
110
+ "As for our data, we'll train and evaluate our system on a low-resource language \n",
111
+ "taken from the [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)\n",
112
+ "dataset. We'll show that with as little as 8 hours of fine-tuning data, we can achieve \n",
113
+ "strong performance in this language."
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "markdown",
118
+ "id": "3a680dfc-cbba-4f6c-8a1f-e1a5ff3f123a",
119
+ "metadata": {
120
+ "id": "3a680dfc-cbba-4f6c-8a1f-e1a5ff3f123a"
121
+ },
122
+ "source": [
123
+ "------------------------------------------------------------------------\n",
124
+ "\n",
125
+ "\\\\({}^1\\\\) The name Whisper follows from the acronym “WSPSR”, which stands for “Web-scale Supervised Pre-training for Speech Recognition”."
126
+ ]
127
+ },
128
+ {
129
+ "cell_type": "markdown",
130
+ "id": "b219c9dd-39b6-4a95-b2a1-3f547a1e7bc0",
131
+ "metadata": {
132
+ "id": "b219c9dd-39b6-4a95-b2a1-3f547a1e7bc0"
133
+ },
134
+ "source": [
135
+ "## Load Dataset"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "markdown",
140
+ "id": "674429c5-0ab4-4adf-975b-621bb69eca38",
141
+ "metadata": {
142
+ "id": "674429c5-0ab4-4adf-975b-621bb69eca38"
143
+ },
144
+ "source": [
145
+ "Using 🤗 Datasets, downloading and preparing data is extremely simple. \n",
146
+ "We can download and prepare the Common Voice splits in just one line of code. \n",
147
+ "\n",
148
+ "First, ensure you have accepted the terms of use on the Hugging Face Hub: [mozilla-foundation/common_voice_11_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.\n",
149
+ "\n",
150
+ "Since Hindi is very low-resource, we'll combine the `train` and `validation` \n",
151
+ "splits to give approximately 8 hours of training data. We'll use the 4 hours \n",
152
+ "of `test` data as our held-out test set:"
153
+ ]
154
+ },
155
+ {
156
+ "cell_type": "code",
157
+ "execution_count": null,
158
+ "id": "a2787582-554f-44ce-9f38-4180a5ed6b44",
159
+ "metadata": {
160
+ "id": "a2787582-554f-44ce-9f38-4180a5ed6b44"
161
+ },
162
+ "outputs": [],
163
+ "source": [
164
+ "from datasets import load_dataset, DatasetDict\n",
165
+ "\n",
166
+ "common_voice = DatasetDict()\n",
167
+ "\n",
168
+ "common_voice[\"train\"] = load_dataset(\"mozilla-foundation/common_voice_11_0\", \"hi\", split=\"train+validation\", use_auth_token=True)\n",
169
+ "common_voice[\"test\"] = load_dataset(\"mozilla-foundation/common_voice_11_0\", \"hi\", split=\"test\", use_auth_token=True)\n",
170
+ "\n",
171
+ "print(common_voice)"
172
+ ]
173
+ },
174
+ {
175
+ "cell_type": "markdown",
176
+ "id": "d5c7c3d6-7197-41e7-a088-49b753c1681f",
177
+ "metadata": {
178
+ "id": "d5c7c3d6-7197-41e7-a088-49b753c1681f"
179
+ },
180
+ "source": [
181
+ "Most ASR datasets only provide input audio samples (`audio`) and the \n",
182
+ "corresponding transcribed text (`sentence`). Common Voice contains additional \n",
183
+ "metadata information, such as `accent` and `locale`, which we can disregard for ASR.\n",
184
+ "Keeping the notebook as general as possible, we only consider the input audio and\n",
185
+ "transcribed text for fine-tuning, discarding the additional metadata information:"
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "code",
190
+ "execution_count": null,
191
+ "id": "20ba635d-518c-47ac-97ee-3cad25f1e0ce",
192
+ "metadata": {
193
+ "id": "20ba635d-518c-47ac-97ee-3cad25f1e0ce"
194
+ },
195
+ "outputs": [],
196
+ "source": [
197
+ "common_voice = common_voice.remove_columns([\"accent\", \"age\", \"client_id\", \"down_votes\", \"gender\", \"locale\", \"path\", \"segment\", \"up_votes\"])\n",
198
+ "\n",
199
+ "print(common_voice)"
200
+ ]
201
+ },
202
+ {
203
+ "cell_type": "markdown",
204
+ "id": "2d63b2d2-f68a-4d74-b7f1-5127f6d16605",
205
+ "metadata": {
206
+ "id": "2d63b2d2-f68a-4d74-b7f1-5127f6d16605"
207
+ },
208
+ "source": [
209
+ "## Prepare Feature Extractor, Tokenizer and Data"
210
+ ]
211
+ },
212
+ {
213
+ "cell_type": "markdown",
214
+ "id": "601c3099-1026-439e-93e2-5635b3ba5a73",
215
+ "metadata": {
216
+ "id": "601c3099-1026-439e-93e2-5635b3ba5a73"
217
+ },
218
+ "source": [
219
+ "The ASR pipeline can be de-composed into three stages: \n",
220
+ "1) A feature extractor which pre-processes the raw audio-inputs\n",
221
+ "2) The model which performs the sequence-to-sequence mapping \n",
222
+ "3) A tokenizer which post-processes the model outputs to text format\n",
223
+ "\n",
224
+ "In 🤗 Transformers, the Whisper model has an associated feature extractor and tokenizer, \n",
225
+ "called [WhisperFeatureExtractor](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperFeatureExtractor)\n",
226
+ "and [WhisperTokenizer](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperTokenizer) \n",
227
+ "respectively.\n",
228
+ "\n",
229
+ "We'll go through details for setting-up the feature extractor and tokenizer one-by-one!"
230
+ ]
231
+ },
232
+ {
233
+ "cell_type": "markdown",
234
+ "id": "560332eb-3558-41a1-b500-e83a9f695f84",
235
+ "metadata": {
236
+ "id": "560332eb-3558-41a1-b500-e83a9f695f84"
237
+ },
238
+ "source": [
239
+ "### Load WhisperFeatureExtractor"
240
+ ]
241
+ },
242
+ {
243
+ "cell_type": "markdown",
244
+ "id": "32ec8068-0bd7-412d-b662-0edb9d1e7365",
245
+ "metadata": {
246
+ "id": "32ec8068-0bd7-412d-b662-0edb9d1e7365"
247
+ },
248
+ "source": [
249
+ "The Whisper feature extractor performs two operations:\n",
250
+ "1. Pads / truncates the audio inputs to 30s: any audio inputs shorter than 30s are padded to 30s with silence (zeros), and those longer that 30s are truncated to 30s\n",
251
+ "2. Converts the audio inputs to _log-Mel spectrogram_ input features, a visual representation of the audio and the form of the input expected by the Whisper model"
252
+ ]
253
+ },
254
+ {
255
+ "cell_type": "markdown",
256
+ "id": "589d9ec1-d12b-4b64-93f7-04c63997da19",
257
+ "metadata": {
258
+ "id": "589d9ec1-d12b-4b64-93f7-04c63997da19"
259
+ },
260
+ "source": [
261
+ "<figure>\n",
262
+ "<img src=\"https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/spectrogram.jpg\" alt=\"Trulli\" style=\"width:100%\">\n",
263
+ "<figcaption align = \"center\"><b>Figure 2:</b> Conversion of sampled audio array to log-Mel spectrogram.\n",
264
+ "Left: sampled 1-dimensional audio signal. Right: corresponding log-Mel spectrogram. Figure source:\n",
265
+ "<a href=\"https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html\">Google SpecAugment Blog</a>.\n",
266
+ "</figcaption>"
267
+ ]
268
+ },
269
+ {
270
+ "cell_type": "markdown",
271
+ "id": "b2ef54d5-b946-4c1d-9fdc-adc5d01b46aa",
272
+ "metadata": {
273
+ "id": "b2ef54d5-b946-4c1d-9fdc-adc5d01b46aa"
274
+ },
275
+ "source": [
276
+ "We'll load the feature extractor from the pre-trained checkpoint with the default values:"
277
+ ]
278
+ },
279
+ {
280
+ "cell_type": "code",
281
+ "execution_count": null,
282
+ "id": "bc77d7bb-f9e2-47f5-b663-30f7a4321ce5",
283
+ "metadata": {
284
+ "id": "bc77d7bb-f9e2-47f5-b663-30f7a4321ce5"
285
+ },
286
+ "outputs": [],
287
+ "source": [
288
+ "from transformers import WhisperFeatureExtractor\n",
289
+ "\n",
290
+ "feature_extractor = WhisperFeatureExtractor.from_pretrained(\"openai/whisper-small\")"
291
+ ]
292
+ },
293
+ {
294
+ "cell_type": "markdown",
295
+ "id": "93748af7-b917-4ecf-a0c8-7d89077ff9cb",
296
+ "metadata": {
297
+ "id": "93748af7-b917-4ecf-a0c8-7d89077ff9cb"
298
+ },
299
+ "source": [
300
+ "### Load WhisperTokenizer"
301
+ ]
302
+ },
303
+ {
304
+ "cell_type": "markdown",
305
+ "id": "2bc82609-a9fb-447a-a2af-99597c864029",
306
+ "metadata": {
307
+ "id": "2bc82609-a9fb-447a-a2af-99597c864029"
308
+ },
309
+ "source": [
310
+ "The Whisper model outputs a sequence of _token ids_. The tokenizer maps each of these token ids to their corresponding text string. For Hindi, we can load the pre-trained tokenizer and use it for fine-tuning without any further modifications. We simply have to \n",
311
+ "specify the target language and the task. These arguments inform the \n",
312
+ "tokenizer to prefix the language and task tokens to the start of encoded \n",
313
+ "label sequences:"
314
+ ]
315
+ },
316
+ {
317
+ "cell_type": "code",
318
+ "execution_count": null,
319
+ "id": "c7b07f9b-ae0e-4f89-98f0-0c50d432eab6",
320
+ "metadata": {
321
+ "id": "c7b07f9b-ae0e-4f89-98f0-0c50d432eab6",
322
+ "outputId": "5c004b44-86e7-4e00-88be-39e0af5eed69"
323
+ },
324
+ "outputs": [
325
+ {
326
+ "data": {
327
+ "application/vnd.jupyter.widget-view+json": {
328
+ "model_id": "90d056e20b3e4f14ae0199a1a4ab1bb0",
329
+ "version_major": 2,
330
+ "version_minor": 0
331
+ },
332
+ "text/plain": [
333
+ "Downloading: 0%| | 0.00/829 [00:00<?, ?B/s]"
334
+ ]
335
+ },
336
+ "metadata": {},
337
+ "output_type": "display_data"
338
+ },
339
+ {
340
+ "data": {
341
+ "application/vnd.jupyter.widget-view+json": {
342
+ "model_id": "d82a88daec0e4f14add691b7b903064c",
343
+ "version_major": 2,
344
+ "version_minor": 0
345
+ },
346
+ "text/plain": [
347
+ "Downloading: 0%| | 0.00/1.04M [00:00<?, ?B/s]"
348
+ ]
349
+ },
350
+ "metadata": {},
351
+ "output_type": "display_data"
352
+ },
353
+ {
354
+ "data": {
355
+ "application/vnd.jupyter.widget-view+json": {
356
+ "model_id": "350acdb0f40e454099fa901e66de55f0",
357
+ "version_major": 2,
358
+ "version_minor": 0
359
+ },
360
+ "text/plain": [
361
+ "Downloading: 0%| | 0.00/494k [00:00<?, ?B/s]"
362
+ ]
363
+ },
364
+ "metadata": {},
365
+ "output_type": "display_data"
366
+ },
367
+ {
368
+ "data": {
369
+ "application/vnd.jupyter.widget-view+json": {
370
+ "model_id": "2e6a82a462cc411d90fa1bea4ee60790",
371
+ "version_major": 2,
372
+ "version_minor": 0
373
+ },
374
+ "text/plain": [
375
+ "Downloading: 0%| | 0.00/52.7k [00:00<?, ?B/s]"
376
+ ]
377
+ },
378
+ "metadata": {},
379
+ "output_type": "display_data"
380
+ },
381
+ {
382
+ "data": {
383
+ "application/vnd.jupyter.widget-view+json": {
384
+ "model_id": "c74bfee0198b4817832ea86e8e88d96c",
385
+ "version_major": 2,
386
+ "version_minor": 0
387
+ },
388
+ "text/plain": [
389
+ "Downloading: 0%| | 0.00/2.11k [00:00<?, ?B/s]"
390
+ ]
391
+ },
392
+ "metadata": {},
393
+ "output_type": "display_data"
394
+ },
395
+ {
396
+ "data": {
397
+ "application/vnd.jupyter.widget-view+json": {
398
+ "model_id": "04fb2d81eff646068e10475a08ae42f4",
399
+ "version_major": 2,
400
+ "version_minor": 0
401
+ },
402
+ "text/plain": [
403
+ "Downloading: 0%| | 0.00/2.06k [00:00<?, ?B/s]"
404
+ ]
405
+ },
406
+ "metadata": {},
407
+ "output_type": "display_data"
408
+ }
409
+ ],
410
+ "source": [
411
+ "from transformers import WhisperTokenizer\n",
412
+ "\n",
413
+ "tokenizer = WhisperTokenizer.from_pretrained(\"openai/whisper-small\", language=\"Hindi\", task=\"transcribe\")"
414
+ ]
415
+ },
416
+ {
417
+ "cell_type": "markdown",
418
+ "id": "d2ef23f3-f4a8-483a-a2dc-080a7496cb1b",
419
+ "metadata": {
420
+ "id": "d2ef23f3-f4a8-483a-a2dc-080a7496cb1b"
421
+ },
422
+ "source": [
423
+ "### Combine To Create A WhisperProcessor"
424
+ ]
425
+ },
426
+ {
427
+ "cell_type": "markdown",
428
+ "id": "5ff67654-5a29-4bb8-a69d-0228946c6f8d",
429
+ "metadata": {
430
+ "id": "5ff67654-5a29-4bb8-a69d-0228946c6f8d"
431
+ },
432
+ "source": [
433
+ "To simplify using the feature extractor and tokenizer, we can _wrap_ \n",
434
+ "both into a single `WhisperProcessor` class. This processor object \n",
435
+ "inherits from the `WhisperFeatureExtractor` and `WhisperProcessor`, \n",
436
+ "and can be used on the audio inputs and model predictions as required. \n",
437
+ "In doing so, we only need to keep track of two objects during training: \n",
438
+ "the `processor` and the `model`:"
439
+ ]
440
+ },
441
+ {
442
+ "cell_type": "code",
443
+ "execution_count": null,
444
+ "id": "77d9f0c5-8607-4642-a8ac-c3ab2e223ea6",
445
+ "metadata": {
446
+ "id": "77d9f0c5-8607-4642-a8ac-c3ab2e223ea6"
447
+ },
448
+ "outputs": [],
449
+ "source": [
450
+ "from transformers import WhisperProcessor\n",
451
+ "\n",
452
+ "processor = WhisperProcessor.from_pretrained(\"openai/whisper-small\", language=\"Hindi\", task=\"transcribe\")"
453
+ ]
454
+ },
455
+ {
456
+ "cell_type": "markdown",
457
+ "id": "381acd09-0b0f-4d04-9eb3-f028ac0e5f2c",
458
+ "metadata": {
459
+ "id": "381acd09-0b0f-4d04-9eb3-f028ac0e5f2c"
460
+ },
461
+ "source": [
462
+ "### Prepare Data"
463
+ ]
464
+ },
465
+ {
466
+ "cell_type": "markdown",
467
+ "id": "9649bf01-2e8a-45e5-8fca-441c13637b8f",
468
+ "metadata": {
469
+ "id": "9649bf01-2e8a-45e5-8fca-441c13637b8f"
470
+ },
471
+ "source": [
472
+ "Let's print the first example of the Common Voice dataset to see \n",
473
+ "what form the data is in:"
474
+ ]
475
+ },
476
+ {
477
+ "cell_type": "code",
478
+ "execution_count": null,
479
+ "id": "6e6b0ec5-0c94-4e2c-ae24-c791be1b2255",
480
+ "metadata": {
481
+ "id": "6e6b0ec5-0c94-4e2c-ae24-c791be1b2255"
482
+ },
483
+ "outputs": [],
484
+ "source": [
485
+ "print(common_voice[\"train\"][0])"
486
+ ]
487
+ },
488
+ {
489
+ "cell_type": "markdown",
490
+ "id": "5a679f05-063d-41b3-9b58-4fc9c6ccf4fd",
491
+ "metadata": {
492
+ "id": "5a679f05-063d-41b3-9b58-4fc9c6ccf4fd"
493
+ },
494
+ "source": [
495
+ "Since \n",
496
+ "our input audio is sampled at 48kHz, we need to _downsample_ it to \n",
497
+ "16kHz prior to passing it to the Whisper feature extractor, 16kHz being the sampling rate expected by the Whisper model. \n",
498
+ "\n",
499
+ "We'll set the audio inputs to the correct sampling rate using dataset's \n",
500
+ "[`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cast_column#datasets.DatasetDict.cast_column)\n",
501
+ "method. This operation does not change the audio in-place, \n",
502
+ "but rather signals to `datasets` to resample audio samples _on the fly_ the \n",
503
+ "first time that they are loaded:"
504
+ ]
505
+ },
506
+ {
507
+ "cell_type": "code",
508
+ "execution_count": null,
509
+ "id": "f12e2e57-156f-417b-8cfb-69221cc198e8",
510
+ "metadata": {
511
+ "id": "f12e2e57-156f-417b-8cfb-69221cc198e8"
512
+ },
513
+ "outputs": [],
514
+ "source": [
515
+ "from datasets import Audio\n",
516
+ "\n",
517
+ "common_voice = common_voice.cast_column(\"audio\", Audio(sampling_rate=16000))"
518
+ ]
519
+ },
520
+ {
521
+ "cell_type": "markdown",
522
+ "id": "00382a3e-abec-4cdd-a54c-d1aaa3ea4707",
523
+ "metadata": {
524
+ "id": "00382a3e-abec-4cdd-a54c-d1aaa3ea4707"
525
+ },
526
+ "source": [
527
+ "Re-loading the first audio sample in the Common Voice dataset will resample \n",
528
+ "it to the desired sampling rate:"
529
+ ]
530
+ },
531
+ {
532
+ "cell_type": "code",
533
+ "execution_count": null,
534
+ "id": "87122d71-289a-466a-afcf-fa354b18946b",
535
+ "metadata": {
536
+ "id": "87122d71-289a-466a-afcf-fa354b18946b"
537
+ },
538
+ "outputs": [],
539
+ "source": [
540
+ "print(common_voice[\"train\"][0])"
541
+ ]
542
+ },
543
+ {
544
+ "cell_type": "markdown",
545
+ "id": "3df7378a-a4c0-45d7-8d07-defbd1062ab6",
546
+ "metadata": {},
547
+ "source": [
548
+ "We'll define our pre-processing strategy. We advise that you **do not** lower-case the transcriptions or remove punctuation unless mixing different datasets. This will enable you to fine-tune Whisper models that can predict punctuation and casing. Later, you will see how we can evaluate the predictions without punctuation or casing, so that the models benefit from the WER improvement obtained by normalising the transcriptions while still predicting fully formatted transcriptions."
549
+ ]
550
+ },
551
+ {
552
+ "cell_type": "code",
553
+ "execution_count": null,
554
+ "id": "d041650e-1c48-4439-87b3-5b6f4a514107",
555
+ "metadata": {},
556
+ "outputs": [],
557
+ "source": [
558
+ "from transformers.models.whisper.english_normalizer import BasicTextNormalizer\n",
559
+ "\n",
560
+ "do_lower_case = False\n",
561
+ "do_remove_punctuation = False\n",
562
+ "\n",
563
+ "normalizer = BasicTextNormalizer()"
564
+ ]
565
+ },
566
+ {
567
+ "cell_type": "markdown",
568
+ "id": "89e12c2e-2f14-479b-987b-f0c75c881095",
569
+ "metadata": {},
570
+ "source": [
571
+ "Now we can write a function to prepare our data ready for the model:\n",
572
+ "1. We load and resample the audio data by calling `batch[\"audio\"]`. As explained above, 🤗 Datasets performs any necessary resampling operations on the fly.\n",
573
+ "2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.\n",
574
+ "3. We perform any optional pre-processing (lower-case or remove punctuation).\n",
575
+ "4. We encode the transcriptions to label ids through the use of the tokenizer."
576
+ ]
577
+ },
578
+ {
579
+ "cell_type": "code",
580
+ "execution_count": null,
581
+ "id": "c085911c-a10a-41ef-8874-306e0503e9bb",
582
+ "metadata": {},
583
+ "outputs": [],
584
+ "source": [
585
+ "def prepare_dataset(batch):\n",
586
+ " # load and (possibly) resample audio data to 16kHz\n",
587
+ " audio = batch[\"audio\"]\n",
588
+ "\n",
589
+ " # compute log-Mel input features from input audio array \n",
590
+ " batch[\"input_features\"] = processor.feature_extractor(audio[\"array\"], sampling_rate=audio[\"sampling_rate\"]).input_features[0]\n",
591
+ " # compute input length of audio sample in seconds\n",
592
+ " batch[\"input_length\"] = len(audio[\"array\"]) / audio[\"sampling_rate\"]\n",
593
+ " \n",
594
+ " # optional pre-processing steps\n",
595
+ " transcription = batch[\"sentence\"]\n",
596
+ " if do_lower_case:\n",
597
+ " transcription = transcription.lower()\n",
598
+ " if do_remove_punctuation:\n",
599
+ " transcription = normalizer(transcription).strip()\n",
600
+ " \n",
601
+ " # encode target text to label ids\n",
602
+ " batch[\"labels\"] = processor.tokenizer(transcription).input_ids\n",
603
+ " return batch"
604
+ ]
605
+ },
606
+ {
607
+ "cell_type": "markdown",
608
+ "id": "8c960965-9fb6-466f-9dbd-c9d43e71d9d0",
609
+ "metadata": {
610
+ "id": "70b319fb-2439-4ef6-a70d-a47bf41c4a13"
611
+ },
612
+ "source": [
613
+ "We can apply the data preparation function to all of our training examples using dataset's `.map` method. The argument `num_proc` specifies how many CPU cores to use. Setting `num_proc` > 1 will enable multiprocessing. If the `.map` method hangs with multiprocessing, set `num_proc=1` and process the dataset sequentially."
614
+ ]
615
+ },
616
+ {
617
+ "cell_type": "code",
618
+ "execution_count": null,
619
+ "id": "7b73ab39-ffaf-4b9e-86e5-782963c6134b",
620
+ "metadata": {
621
+ "id": "7b73ab39-ffaf-4b9e-86e5-782963c6134b"
622
+ },
623
+ "outputs": [],
624
+ "source": [
625
+ "common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names[\"train\"], num_proc=2)"
626
+ ]
627
+ },
628
+ {
629
+ "cell_type": "markdown",
630
+ "id": "54ce0fdb-7218-4a4d-b175-383980fec0df",
631
+ "metadata": {},
632
+ "source": [
633
+ "Finally, we filter any training data with audio samples longer than 30s. These samples would otherwise be truncated by the Whisper feature-extractor which could affect the stability of training. We define a function that returns `True` for samples that are less than 30s, and `False` for those that are longer:"
634
+ ]
635
+ },
636
+ {
637
+ "cell_type": "code",
638
+ "execution_count": null,
639
+ "id": "01cb25ef-4bb0-4325-9461-f59198acadf6",
640
+ "metadata": {},
641
+ "outputs": [],
642
+ "source": [
643
+ "max_input_length = 30.0\n",
644
+ "\n",
645
+ "def is_audio_in_length_range(length):\n",
646
+ " return length < max_input_length"
647
+ ]
648
+ },
649
+ {
650
+ "cell_type": "markdown",
651
+ "id": "30e676a8-7ca8-4850-8c5d-5b2b00d13fba",
652
+ "metadata": {},
653
+ "source": [
654
+ "We apply our filter function to all samples of our training dataset through 🤗 Datasets' `.filter` method:"
655
+ ]
656
+ },
657
+ {
658
+ "cell_type": "code",
659
+ "execution_count": null,
660
+ "id": "333f7f6e-6053-4d3b-8924-c733c79b82ac",
661
+ "metadata": {},
662
+ "outputs": [],
663
+ "source": [
664
+ "common_voice[\"train\"] = common_voice[\"train\"].filter(\n",
665
+ " is_audio_in_length_range,\n",
666
+ " input_columns=[\"input_length\"],\n",
667
+ ")"
668
+ ]
669
+ },
670
+ {
671
+ "cell_type": "markdown",
672
+ "id": "263a5a58-0239-4a25-b0df-c625fc9c5810",
673
+ "metadata": {
674
+ "id": "263a5a58-0239-4a25-b0df-c625fc9c5810"
675
+ },
676
+ "source": [
677
+ "## Training and Evaluation"
678
+ ]
679
+ },
680
+ {
681
+ "cell_type": "markdown",
682
+ "id": "a693e768-c5a6-453f-89a1-b601dcf7daf7",
683
+ "metadata": {
684
+ "id": "a693e768-c5a6-453f-89a1-b601dcf7daf7"
685
+ },
686
+ "source": [
687
+ "Now that we've prepared our data, we're ready to dive into the training pipeline. \n",
688
+ "The [🤗 Trainer](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer)\n",
689
+ "will do much of the heavy lifting for us. All we have to do is:\n",
690
+ "\n",
691
+ "- Define a data collator: the data collator takes our pre-processed data and prepares PyTorch tensors ready for the model.\n",
692
+ "\n",
693
+ "- Evaluation metrics: during evaluation, we want to evaluate the model using the [word error rate (WER)](https://huggingface.co/metrics/wer) metric. We need to define a `compute_metrics` function that handles this computation.\n",
694
+ "\n",
695
+ "- Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.\n",
696
+ "\n",
697
+ "- Define the training configuration: this will be used by the 🤗 Trainer to define the training schedule.\n",
698
+ "\n",
699
+ "Once we've fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it \n",
700
+ "to transcribe speech in Hindi."
701
+ ]
702
+ },
703
+ {
704
+ "cell_type": "markdown",
705
+ "id": "8d230e6d-624c-400a-bbf5-fa660881df25",
706
+ "metadata": {
707
+ "id": "8d230e6d-624c-400a-bbf5-fa660881df25"
708
+ },
709
+ "source": [
710
+ "### Define a Data Collator"
711
+ ]
712
+ },
713
+ {
714
+ "cell_type": "markdown",
715
+ "id": "04def221-0637-4a69-b242-d3f0c1d0ee78",
716
+ "metadata": {
717
+ "id": "04def221-0637-4a69-b242-d3f0c1d0ee78"
718
+ },
719
+ "source": [
720
+ "The data collator for a sequence-to-sequence speech model is unique in the sense that it \n",
721
+ "treats the `input_features` and `labels` independently: the `input_features` must be \n",
722
+ "handled by the feature extractor and the `labels` by the tokenizer.\n",
723
+ "\n",
724
+ "The `input_features` are already padded to 30s and converted to a log-Mel spectrogram \n",
725
+ "of fixed dimension by action of the feature extractor, so all we have to do is convert the `input_features`\n",
726
+ "to batched PyTorch tensors. We do this using the feature extractor's `.pad` method with `return_tensors=pt`.\n",
727
+ "\n",
728
+ "The `labels` on the other hand are un-padded. We first pad the sequences\n",
729
+ "to the maximum length in the batch using the tokenizer's `.pad` method. The padding tokens \n",
730
+ "are then replaced by `-100` so that these tokens are **not** taken into account when \n",
731
+ "computing the loss. We then cut the BOS token from the start of the label sequence as we \n",
732
+ "append it later during training.\n",
733
+ "\n",
734
+ "We can leverage the `WhisperProcessor` we defined earlier to perform both the \n",
735
+ "feature extractor and the tokenizer operations:"
736
+ ]
737
+ },
738
+ {
739
+ "cell_type": "code",
740
+ "execution_count": null,
741
+ "id": "8326221e-ec13-4731-bb4e-51e5fc1486c5",
742
+ "metadata": {
743
+ "id": "8326221e-ec13-4731-bb4e-51e5fc1486c5"
744
+ },
745
+ "outputs": [],
746
+ "source": [
747
+ "import torch\n",
748
+ "\n",
749
+ "from dataclasses import dataclass\n",
750
+ "from typing import Any, Dict, List, Union\n",
751
+ "\n",
752
+ "@dataclass\n",
753
+ "class DataCollatorSpeechSeq2SeqWithPadding:\n",
754
+ " processor: Any\n",
755
+ "\n",
756
+ " def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:\n",
757
+ " # split inputs and labels since they have to be of different lengths and need different padding methods\n",
758
+ " # first treat the audio inputs by simply returning torch tensors\n",
759
+ " input_features = [{\"input_features\": feature[\"input_features\"]} for feature in features]\n",
760
+ " batch = self.processor.feature_extractor.pad(input_features, return_tensors=\"pt\")\n",
761
+ "\n",
762
+ " # get the tokenized label sequences\n",
763
+ " label_features = [{\"input_ids\": feature[\"labels\"]} for feature in features]\n",
764
+ " # pad the labels to max length\n",
765
+ " labels_batch = self.processor.tokenizer.pad(label_features, return_tensors=\"pt\")\n",
766
+ "\n",
767
+ " # replace padding with -100 to ignore loss correctly\n",
768
+ " labels = labels_batch[\"input_ids\"].masked_fill(labels_batch.attention_mask.ne(1), -100)\n",
769
+ "\n",
770
+ " # if bos token is appended in previous tokenization step,\n",
771
+ " # cut bos token here as it's append later anyways\n",
772
+ " if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():\n",
773
+ " labels = labels[:, 1:]\n",
774
+ "\n",
775
+ " batch[\"labels\"] = labels\n",
776
+ "\n",
777
+ " return batch"
778
+ ]
779
+ },
780
+ {
781
+ "cell_type": "markdown",
782
+ "id": "3cae7dbf-8a50-456e-a3a8-7fd005390f86",
783
+ "metadata": {
784
+ "id": "3cae7dbf-8a50-456e-a3a8-7fd005390f86"
785
+ },
786
+ "source": [
787
+ "Let's initialise the data collator we've just defined:"
788
+ ]
789
+ },
790
+ {
791
+ "cell_type": "code",
792
+ "execution_count": null,
793
+ "id": "fc834702-c0d3-4a96-b101-7b87be32bf42",
794
+ "metadata": {
795
+ "id": "fc834702-c0d3-4a96-b101-7b87be32bf42"
796
+ },
797
+ "outputs": [],
798
+ "source": [
799
+ "data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)"
800
+ ]
801
+ },
802
+ {
803
+ "cell_type": "markdown",
804
+ "id": "d62bb2ab-750a-45e7-82e9-61d6f4805698",
805
+ "metadata": {
806
+ "id": "d62bb2ab-750a-45e7-82e9-61d6f4805698"
807
+ },
808
+ "source": [
809
+ "### Evaluation Metrics"
810
+ ]
811
+ },
812
+ {
813
+ "cell_type": "markdown",
814
+ "id": "66fee1a7-a44c-461e-b047-c3917221572e",
815
+ "metadata": {
816
+ "id": "66fee1a7-a44c-461e-b047-c3917221572e"
817
+ },
818
+ "source": [
819
+ "We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing \n",
820
+ "ASR systems. For more information, refer to the WER [docs](https://huggingface.co/metrics/wer). We'll load the WER metric from 🤗 Evaluate:"
821
+ ]
822
+ },
823
+ {
824
+ "cell_type": "code",
825
+ "execution_count": null,
826
+ "id": "b22b4011-f31f-4b57-b684-c52332f92890",
827
+ "metadata": {
828
+ "id": "b22b4011-f31f-4b57-b684-c52332f92890"
829
+ },
830
+ "outputs": [],
831
+ "source": [
832
+ "import evaluate\n",
833
+ "\n",
834
+ "metric = evaluate.load(\"wer\")"
835
+ ]
836
+ },
837
+ {
838
+ "cell_type": "markdown",
839
+ "id": "4f32cab6-31f0-4cb9-af4c-40ba0f5fc508",
840
+ "metadata": {
841
+ "id": "4f32cab6-31f0-4cb9-af4c-40ba0f5fc508"
842
+ },
843
+ "source": [
844
+ "We then simply have to define a function that takes our model \n",
845
+ "predictions and returns the WER metric. This function, called\n",
846
+ "`compute_metrics`, first replaces `-100` with the `pad_token_id`\n",
847
+ "in the `label_ids` (undoing the step we applied in the \n",
848
+ "data collator to ignore padded tokens correctly in the loss).\n",
849
+ "It then decodes the predicted and label ids to strings. Finally,\n",
850
+ "it computes the WER between the predictions and reference labels. \n",
851
+ "Here, we have the option of evaluating with the 'normalised' transcriptions \n",
852
+ "and predictions. We recommend you set this to `True` to benefit from the WER \n",
853
+ "improvement obtained by normalising the transcriptions."
854
+ ]
855
+ },
856
+ {
857
+ "cell_type": "code",
858
+ "execution_count": null,
859
+ "id": "23959a70-22d0-4ffe-9fa1-72b61e75bb52",
860
+ "metadata": {
861
+ "id": "23959a70-22d0-4ffe-9fa1-72b61e75bb52"
862
+ },
863
+ "outputs": [],
864
+ "source": [
865
+ "# evaluate with the 'normalised' WER\n",
866
+ "do_normalize_eval = True\n",
867
+ "\n",
868
+ "def compute_metrics(pred):\n",
869
+ " pred_ids = pred.predictions\n",
870
+ " label_ids = pred.label_ids\n",
871
+ "\n",
872
+ " # replace -100 with the pad_token_id\n",
873
+ " label_ids[label_ids == -100] = processor.tokenizer.pad_token_id\n",
874
+ "\n",
875
+ " # we do not want to group tokens when computing the metrics\n",
876
+ " pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)\n",
877
+ " label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)\n",
878
+ "\n",
879
+ " if do_normalize_eval:\n",
880
+ " pred_str = [normalizer(pred) for pred in pred_str]\n",
881
+ " label_str = [normalizer(label) for label in label_str]\n",
882
+ "\n",
883
+ " wer = 100 * metric.compute(predictions=pred_str, references=label_str)\n",
884
+ "\n",
885
+ " return {\"wer\": wer}"
886
+ ]
887
+ },
888
+ {
889
+ "cell_type": "markdown",
890
+ "id": "daf2a825-6d9f-4a23-b145-c37c0039075b",
891
+ "metadata": {
892
+ "id": "daf2a825-6d9f-4a23-b145-c37c0039075b"
893
+ },
894
+ "source": [
895
+ "### Load a Pre-Trained Checkpoint"
896
+ ]
897
+ },
898
+ {
899
+ "cell_type": "markdown",
900
+ "id": "437a97fa-4864-476b-8abc-f28b8166cfa5",
901
+ "metadata": {
902
+ "id": "437a97fa-4864-476b-8abc-f28b8166cfa5"
903
+ },
904
+ "source": [
905
+ "Now let's load the pre-trained Whisper `small` checkpoint. Again, this \n",
906
+ "is trivial through use of 🤗 Transformers!"
907
+ ]
908
+ },
909
+ {
910
+ "cell_type": "code",
911
+ "execution_count": null,
912
+ "id": "5a10cc4b-07ec-4ebd-ac1d-7c601023594f",
913
+ "metadata": {
914
+ "id": "5a10cc4b-07ec-4ebd-ac1d-7c601023594f"
915
+ },
916
+ "outputs": [],
917
+ "source": [
918
+ "from transformers import WhisperForConditionalGeneration\n",
919
+ "\n",
920
+ "model = WhisperForConditionalGeneration.from_pretrained(\"openai/whisper-small\")"
921
+ ]
922
+ },
923
+ {
924
+ "cell_type": "markdown",
925
+ "id": "a15ead5f-2277-4a39-937b-585c2497b2df",
926
+ "metadata": {
927
+ "id": "a15ead5f-2277-4a39-937b-585c2497b2df"
928
+ },
929
+ "source": [
930
+ "Override generation arguments - no tokens are forced as decoder outputs (see [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)), no tokens are suppressed during generation (see [`suppress_tokens`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.suppress_tokens)). Set `use_cache` to False since we're using gradient checkpointing, and the two are incompatible:"
931
+ ]
932
+ },
933
+ {
934
+ "cell_type": "code",
935
+ "execution_count": null,
936
+ "id": "62038ba3-88ed-4fce-84db-338f50dcd04f",
937
+ "metadata": {
938
+ "id": "62038ba3-88ed-4fce-84db-338f50dcd04f"
939
+ },
940
+ "outputs": [],
941
+ "source": [
942
+ "model.config.forced_decoder_ids = None\n",
943
+ "model.config.suppress_tokens = []\n",
944
+ "model.config.use_cache = False"
945
+ ]
946
+ },
947
+ {
948
+ "cell_type": "markdown",
949
+ "id": "2178dea4-80ca-47b6-b6ea-ba1915c90c06",
950
+ "metadata": {
951
+ "id": "2178dea4-80ca-47b6-b6ea-ba1915c90c06"
952
+ },
953
+ "source": [
954
+ "### Define the Training Configuration"
955
+ ]
956
+ },
957
+ {
958
+ "cell_type": "markdown",
959
+ "id": "c21af1e9-0188-4134-ac82-defc7bdcc436",
960
+ "metadata": {
961
+ "id": "c21af1e9-0188-4134-ac82-defc7bdcc436"
962
+ },
963
+ "source": [
964
+ "In the final step, we define all the parameters related to training. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments)."
965
+ ]
966
+ },
967
+ {
968
+ "cell_type": "code",
969
+ "execution_count": null,
970
+ "id": "0ae3e9af-97b7-4aa0-ae85-20b23b5bcb3a",
971
+ "metadata": {
972
+ "id": "0ae3e9af-97b7-4aa0-ae85-20b23b5bcb3a"
973
+ },
974
+ "outputs": [],
975
+ "source": [
976
+ "from transformers import Seq2SeqTrainingArguments\n",
977
+ "\n",
978
+ "training_args = Seq2SeqTrainingArguments(\n",
979
+ " output_dir=\"./\",\n",
980
+ " per_device_train_batch_size=64,\n",
981
+ " gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size\n",
982
+ " learning_rate=1e-5,\n",
983
+ " warmup_steps=500,\n",
984
+ " max_steps=5000,\n",
985
+ " gradient_checkpointing=True,\n",
986
+ " fp16=True,\n",
987
+ " evaluation_strategy=\"steps\",\n",
988
+ " per_device_eval_batch_size=8,\n",
989
+ " predict_with_generate=True,\n",
990
+ " generation_max_length=225,\n",
991
+ " save_steps=1000,\n",
992
+ " eval_steps=1000,\n",
993
+ " logging_steps=25,\n",
994
+ " report_to=[\"tensorboard\"],\n",
995
+ " load_best_model_at_end=True,\n",
996
+ " metric_for_best_model=\"wer\",\n",
997
+ " greater_is_better=False,\n",
998
+ " push_to_hub=True,\n",
999
+ ")"
1000
+ ]
1001
+ },
1002
+ {
1003
+ "cell_type": "markdown",
1004
+ "id": "b3a944d8-3112-4552-82a0-be25988b3857",
1005
+ "metadata": {
1006
+ "id": "b3a944d8-3112-4552-82a0-be25988b3857"
1007
+ },
1008
+ "source": [
1009
+ "**Note**: if one does not want to upload the model checkpoints to the Hub, \n",
1010
+ "set `push_to_hub=False`."
1011
+ ]
1012
+ },
1013
+ {
1014
+ "cell_type": "markdown",
1015
+ "id": "bac29114-d226-4f54-97cf-8718c9f94e1e",
1016
+ "metadata": {
1017
+ "id": "bac29114-d226-4f54-97cf-8718c9f94e1e"
1018
+ },
1019
+ "source": [
1020
+ "We can forward the training arguments to the 🤗 Trainer along with our model,\n",
1021
+ "dataset, data collator and `compute_metrics` function:"
1022
+ ]
1023
+ },
1024
+ {
1025
+ "cell_type": "code",
1026
+ "execution_count": null,
1027
+ "id": "d546d7fe-0543-479a-b708-2ebabec19493",
1028
+ "metadata": {
1029
+ "id": "d546d7fe-0543-479a-b708-2ebabec19493"
1030
+ },
1031
+ "outputs": [],
1032
+ "source": [
1033
+ "from transformers import Seq2SeqTrainer\n",
1034
+ "\n",
1035
+ "trainer = Seq2SeqTrainer(\n",
1036
+ " args=training_args,\n",
1037
+ " model=model,\n",
1038
+ " train_dataset=common_voice[\"train\"],\n",
1039
+ " eval_dataset=common_voice[\"test\"],\n",
1040
+ " data_collator=data_collator,\n",
1041
+ " compute_metrics=compute_metrics,\n",
1042
+ " tokenizer=processor.feature_extractor,\n",
1043
+ ")"
1044
+ ]
1045
+ },
1046
+ {
1047
+ "cell_type": "markdown",
1048
+ "id": "uOrRhDGtN5S4",
1049
+ "metadata": {
1050
+ "id": "uOrRhDGtN5S4"
1051
+ },
1052
+ "source": [
1053
+ "We'll save the processor object once before starting training. Since the processor is not trainable, it won't change over the course of training:"
1054
+ ]
1055
+ },
1056
+ {
1057
+ "cell_type": "code",
1058
+ "execution_count": null,
1059
+ "id": "-2zQwMfEOBJq",
1060
+ "metadata": {
1061
+ "id": "-2zQwMfEOBJq"
1062
+ },
1063
+ "outputs": [],
1064
+ "source": [
1065
+ "processor.save_pretrained(training_args.output_dir)"
1066
+ ]
1067
+ },
1068
+ {
1069
+ "cell_type": "markdown",
1070
+ "id": "7f404cf9-4345-468c-8196-4bd101d9bd51",
1071
+ "metadata": {
1072
+ "id": "7f404cf9-4345-468c-8196-4bd101d9bd51"
1073
+ },
1074
+ "source": [
1075
+ "### Training"
1076
+ ]
1077
+ },
1078
+ {
1079
+ "cell_type": "markdown",
1080
+ "id": "5e8b8d56-5a70-4f68-bd2e-f0752d0bd112",
1081
+ "metadata": {
1082
+ "id": "5e8b8d56-5a70-4f68-bd2e-f0752d0bd112"
1083
+ },
1084
+ "source": [
1085
+ "Training will take approximately 5-10 hours depending on your GPU. The peak GPU memory for the given training configuration is approximately 36GB. \n",
1086
+ "Depending on your GPU, it is possible that you will encounter a CUDA `\"out-of-memory\"` error when you launch training. \n",
1087
+ "In this case, you can reduce the `per_device_train_batch_size` incrementally by factors of 2 \n",
1088
+ "and employ [`gradient_accumulation_steps`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments.gradient_accumulation_steps)\n",
1089
+ "to compensate.\n",
1090
+ "\n",
1091
+ "To launch training, simply execute:"
1092
+ ]
1093
+ },
1094
+ {
1095
+ "cell_type": "code",
1096
+ "execution_count": null,
1097
+ "id": "ee8b7b8e-1c9a-4d77-9137-1778a629e6de",
1098
+ "metadata": {
1099
+ "id": "ee8b7b8e-1c9a-4d77-9137-1778a629e6de"
1100
+ },
1101
+ "outputs": [],
1102
+ "source": [
1103
+ "trainer.train()"
1104
+ ]
1105
+ },
1106
+ {
1107
+ "cell_type": "markdown",
1108
+ "id": "810ced54-7187-4a06-b2fe-ba6dcca94dc3",
1109
+ "metadata": {
1110
+ "id": "810ced54-7187-4a06-b2fe-ba6dcca94dc3"
1111
+ },
1112
+ "source": [
1113
+ "We can label our checkpoint with the `whisper-event` tag on push by setting the appropriate key-word arguments (kwargs):"
1114
+ ]
1115
+ },
1116
+ {
1117
+ "cell_type": "code",
1118
+ "execution_count": null,
1119
+ "id": "c704f91e-241b-48c9-b8e0-f0da396a9663",
1120
+ "metadata": {
1121
+ "id": "c704f91e-241b-48c9-b8e0-f0da396a9663"
1122
+ },
1123
+ "outputs": [],
1124
+ "source": [
1125
+ "kwargs = {\n",
1126
+ " \"dataset_tags\": \"mozilla-foundation/common_voice_11_0\",\n",
1127
+ " \"dataset\": \"Common Voice 11.0\", # a 'pretty' name for the training dataset\n",
1128
+ " \"language\": \"hi\",\n",
1129
+ " \"model_name\": \"Whisper Small Hi - Sanchit Gandhi\", # a 'pretty' name for your model\n",
1130
+ " \"finetuned_from\": \"openai/whisper-small\",\n",
1131
+ " \"tasks\": \"automatic-speech-recognition\",\n",
1132
+ " \"tags\": \"whisper-event\",\n",
1133
+ "}"
1134
+ ]
1135
+ },
1136
+ {
1137
+ "cell_type": "markdown",
1138
+ "id": "090d676a-f944-4297-a938-a40eda0b2b68",
1139
+ "metadata": {
1140
+ "id": "090d676a-f944-4297-a938-a40eda0b2b68"
1141
+ },
1142
+ "source": [
1143
+ "The training results can now be uploaded to the Hub. To do so, execute the `push_to_hub` command and save the preprocessor object we created:"
1144
+ ]
1145
+ },
1146
+ {
1147
+ "cell_type": "code",
1148
+ "execution_count": null,
1149
+ "id": "d7030622-caf7-4039-939b-6195cdaa2585",
1150
+ "metadata": {
1151
+ "id": "d7030622-caf7-4039-939b-6195cdaa2585"
1152
+ },
1153
+ "outputs": [],
1154
+ "source": [
1155
+ "trainer.push_to_hub(**kwargs)"
1156
+ ]
1157
+ },
1158
+ {
1159
+ "cell_type": "markdown",
1160
+ "id": "ca743fbd-602c-48d4-ba8d-a2fe60af64ba",
1161
+ "metadata": {
1162
+ "id": "ca743fbd-602c-48d4-ba8d-a2fe60af64ba"
1163
+ },
1164
+ "source": [
1165
+ "## Closing Remarks"
1166
+ ]
1167
+ },
1168
+ {
1169
+ "cell_type": "markdown",
1170
+ "id": "7f737783-2870-4e35-aa11-86a42d7d997a",
1171
+ "metadata": {
1172
+ "id": "7f737783-2870-4e35-aa11-86a42d7d997a"
1173
+ },
1174
+ "source": [
1175
+ "In this blog, we covered a step-by-step guide on fine-tuning Whisper for multilingual ASR \n",
1176
+ "using 🤗 Datasets, Transformers and the Hugging Face Hub. For more details on the Whisper model, the Common Voice dataset and the theory behind fine-tuning, refere to the accompanying [blog post](https://huggingface.co/blog/fine-tune-whisper). If you're interested in fine-tuning other \n",
1177
+ "Transformers models, both for English and multilingual ASR, be sure to check out the \n",
1178
+ "examples scripts at [examples/pytorch/speech-recognition](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition)."
1179
+ ]
1180
+ }
1181
+ ],
1182
+ "metadata": {
1183
+ "colab": {
1184
+ "include_colab_link": true,
1185
+ "provenance": []
1186
+ },
1187
+ "kernelspec": {
1188
+ "display_name": "Python 3 (ipykernel)",
1189
+ "language": "python",
1190
+ "name": "python3"
1191
+ },
1192
+ "language_info": {
1193
+ "codemirror_mode": {
1194
+ "name": "ipython",
1195
+ "version": 3
1196
+ },
1197
+ "file_extension": ".py",
1198
+ "mimetype": "text/x-python",
1199
+ "name": "python",
1200
+ "nbconvert_exporter": "python",
1201
+ "pygments_lexer": "ipython3",
1202
+ "version": "3.8.9"
1203
+ }
1204
+ },
1205
+ "nbformat": 4,
1206
+ "nbformat_minor": 5
1207
+ }
fine-tune-whisper-streaming.ipynb ADDED
@@ -0,0 +1,883 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "75b58048-7d14-4fc6-8085-1fc08c81b4a6",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Fine-Tune Whisper With 🤗 Transformers and Streaming Mode"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "fbfa8ad5-4cdc-4512-9058-836cbbf65e1a",
14
+ "metadata": {},
15
+ "source": [
16
+ "In this Colab, we present a step-by-step guide on fine-tuning Whisper with Hugging Face 🤗 Transformers on 400 hours of speech data! Using streaming mode, we'll show how you can train a speech recongition model on any dataset, irrespective of size. With streaming mode, storage requirements are no longer a consideration: you can train a model on whatever dataset you want, even if it's download size exceeds your devices disk space. How can this be possible? It simply seems too good to be true! Well, rest assured it's not 😉 Carry on reading to find out more."
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "afe0d503-ae4e-4aa7-9af4-dbcba52db41e",
22
+ "metadata": {},
23
+ "source": [
24
+ "## Introduction"
25
+ ]
26
+ },
27
+ {
28
+ "cell_type": "markdown",
29
+ "id": "9ae91ed4-9c3e-4ade-938e-f4c2dcfbfdc0",
30
+ "metadata": {},
31
+ "source": [
32
+ "Speech recognition datasets are large. A typical speech dataset consists of approximately 100 hours of audio-transcription data, requiring upwards of 130GB of storage space for download and preparation. For most ASR researchers, this is already at the upper limit of what is feasible for disk space. So what happens when we want to train on a larger dataset? The full [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) dataset consists of 960 hours of audio data. Kensho's [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) contains 5,000 hours of audio data. ML Commons [People's Speech](https://huggingface.co/datasets/MLCommons/peoples_speech) contains **30,000+** hours of audio data! Do we need to bite the bullet and buy additional storage? Or is there a way we can train on all of these datasets with no disk drive requirements?\n",
33
+ "\n",
34
+ "When training machine learning systems, we rarely use the entire dataset at once. We typically _batch_ our data into smaller subsets of data, and pass these incrementally through our training pipeline. This is because we train our system on an accelerator device, such as a GPU or TPU, which has a memory limit typically around 16GB. We have to fit our model, optimiser and training data all on the same accelerator device, so we usually have to divide the dataset up into smaller batches and move them from the CPU to the GPU when required.\n",
35
+ "\n",
36
+ "Consequently, we don't require the entire dataset to be downloaded at once; we simply need the batch of data that we pass to our model at any one go. We can leverage this principle of partial dataset loading when preparing our dataset: rather than downloading the entire dataset at the start, we can load each piece of data as and when we need it. For each batch, we load the relevant data from a remote server and pass it through the training pipeline. For the next batch, we load the next items and again pass them through the training pipeline. At no point do we have to save data to our disk drive, we simply load them in memory and use them in our pipeline. In doing so, we only ever need as much memory as each individual batch requires.\n",
37
+ "\n",
38
+ "This is analogous to downloading a TV show versus streaming it 📺 When we download a TV show, we download the entire video offline and save it to our disk. Compare this to when we stream a TV show. Here, we don't download any part of the video to memory, but iterate over the video file and load each part in real-time as required. It's this same principle that we can apply to our ML training pipeline! We want to iterate over the dataset and load each sample of data as required.\n",
39
+ "\n",
40
+ "While the principle of partial dataset loading sounds ideal, it also seems **pretty** difficult to do. Luckily for us, 🤗 Datasets allows us to do this with minimal code changes! We'll make use of the principle of [_streaming_](https://huggingface.co/docs/datasets/stream), depicted graphically in Figure 1. Streaming does exactly this: the data is loaded progressively as we iterate over the dataset, meaning it is only loaded as and when we need it. If you're familiar with 🤗 Transformers and Datasets, the content of this notebook will be very familiar, with some small extensions to support streaming mode."
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "markdown",
45
+ "id": "1c87f76e-47be-4a5d-bc52-7b1c2e9d4f5a",
46
+ "metadata": {},
47
+ "source": [
48
+ "<figure>\n",
49
+ "<img src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/streaming.gif\" alt=\"Trulli\" style=\"width:100%\">\n",
50
+ "<figcaption align = \"center\"><b>Figure 1:</b> Streaming mode. The dataset is divided into smaller subsets, with subsets loaded progressively as we iterate over the dataset. </figcaption>\n",
51
+ "</figure>"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "markdown",
56
+ "id": "21b6316e-8a55-4549-a154-66d3da2ab74a",
57
+ "metadata": {},
58
+ "source": [
59
+ "This notebook provides a guide to fine-tuning on the task of _speech recognition_, which involves learning a\n",
60
+ "mapping from speech to text. Speech recognition is divided into two categories: English-only or multilingual (all other languages). \n",
61
+ "This notebook applies to both categories, with pointers for changing between languages and datasets.\n",
62
+ "\n",
63
+ "As for our model, we'll fine-tune the Whisper model released in [September 2022](https://openai.com/blog/whisper/) by the authors \n",
64
+ "Alec Radford et al. from OpenAI. Whisper is an encoder-decoder model pre-trained on 680k hours of labelled audio-transcription data. \n",
65
+ "It achieves strong performance on many speech recognition and speech translation datasets without fine-tuning. With fine-tuning, \n",
66
+ "we aim to improve upon these results further, with many SoTA results up for grabs! For a full explanation on the Whisper model, the \n",
67
+ "reader is advised to read the blog post [Fine-Tune Whisper with 🤗 Transformers](https://huggingface.co/blog/fine-tune-whisper#introduction).\n",
68
+ "\n",
69
+ "The Whisper checkpoints come in five configurations of varying model sizes.\n",
70
+ "The smallest four are trained on either English-only or multilingual data.\n",
71
+ "The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints \n",
72
+ "are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The \n",
73
+ "checkpoints are summarised in the following table with links to the models on the Hub:\n",
74
+ "\n",
75
+ "| Size | Layers | Width | Heads | Parameters | English-only | Multilingual |\n",
76
+ "|--------|--------|-------|-------|------------|------------------------------------------------------|---------------------------------------------------|\n",
77
+ "| tiny | 4 | 384 | 6 | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny.) |\n",
78
+ "| base | 6 | 512 | 8 | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |\n",
79
+ "| small | 12 | 768 | 12 | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |\n",
80
+ "| medium | 24 | 1024 | 16 | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |\n",
81
+ "| large | 32 | 1280 | 20 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |\n",
82
+ "\n",
83
+ "When fine-tuning on an English dataset for speech recognition, it is recommeneded to select one of the English-only checkpoints. For any other language, it is recommended to select a multilingual checkpoint.\n",
84
+ "\n",
85
+ "For demonstration purposes, we'll fine-tune the multilingual version of the \n",
86
+ "[`\"small\"`](https://huggingface.co/openai/whisper-small) checkpoint with 244M params (~= 1GB). \n",
87
+ "As for our data, we'll train and evaluate our system on 400 hours of multilingual speech recognition data\n",
88
+ "taken from the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)\n",
89
+ "dataset. We'll show how we can train a model on 400 hours of training data using the default disk space \n",
90
+ "that comes with a standard GPU device or Google Colab."
91
+ ]
92
+ },
93
+ {
94
+ "cell_type": "markdown",
95
+ "id": "b219c9dd-39b6-4a95-b2a1-3f547a1e7bc0",
96
+ "metadata": {},
97
+ "source": [
98
+ "## Load Dataset with Streaming"
99
+ ]
100
+ },
101
+ {
102
+ "cell_type": "markdown",
103
+ "id": "b17a4763-4381-4157-ae38-b04a8b5f1c43",
104
+ "metadata": {},
105
+ "source": [
106
+ "This is where the magic happens! We'll first write a wrapper function around 🤗 Datasets `load_dataset` method. This function downloads the required splits using streaming mode by forcing `streaming=True` in the `load_dataset` method. Multiple splits can be combined (interleaved) by concatenating them with the \"+\" symbol when specifying the split name, e.g. `split=train+validation` will return a single split with the training and validation splits interleaved together. The function has the same arguments and key-word arguments as 🤗 Datasets `load_dataset` method, so we can use it in exactly the same way!"
107
+ ]
108
+ },
109
+ {
110
+ "cell_type": "code",
111
+ "execution_count": null,
112
+ "id": "065a8cf7-e54f-4ac3-900e-609c80714fca",
113
+ "metadata": {},
114
+ "outputs": [],
115
+ "source": [
116
+ "from datasets import interleave_datasets, load_dataset\n",
117
+ "\n",
118
+ "def load_streaming_dataset(dataset_name, dataset_config_name, split, **kwargs):\n",
119
+ " if \"+\" in split:\n",
120
+ " # load multiple splits separated by the `+` symbol *with* streaming mode\n",
121
+ " dataset_splits = [load_dataset(dataset_name, dataset_config_name, split=split_name, streaming=True, **kwargs) for split_name in split.split(\"+\")]\n",
122
+ " # interleave multiple splits to form one dataset\n",
123
+ " interleaved_dataset = interleave_datasets(dataset_splits)\n",
124
+ " return interleaved_dataset\n",
125
+ " else:\n",
126
+ " # load a single split *with* streaming mode\n",
127
+ " dataset = load_dataset(dataset_name, dataset_config_name, split=split, streaming=True, **kwargs)\n",
128
+ " return dataset"
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "markdown",
133
+ "id": "674429c5-0ab4-4adf-975b-621bb69eca38",
134
+ "metadata": {},
135
+ "source": [
136
+ "We'll train our system on the Spanish split of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). We can see how much training data we have by viewing the [language page](https://commonvoice.mozilla.org/en/datasets) on the Common Voice website. The Spanish split has over 400 hours of labelled training data - that's enourmous! More than we could ever fit on a Google Colab or a standard workstation. But with streaming mode, we'll only download data as and when we need it, making training on this dataset possible!\n",
137
+ "\n",
138
+ "Since Spanish is relatively high-resource, we'll only use the `train` split for training and the `test` split for evaluation. If you're training on a low-resource language, such as the Hindi split of Common Voice 11, it's worth combining the `train` and `validation` splits to give a larger training set. You can achieve this by setting: `split=\"train+validation\"` for the training split.\n",
139
+ "\n",
140
+ "If you're using a gated dataset, like Common Voice 11, ensure you have accepted the terms of use on the Hugging Face Hub: [mozilla-foundation/common_voice_11_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). Once you have accepted the terms, you will have full access to the dataset and be able to load the data locally."
141
+ ]
142
+ },
143
+ {
144
+ "cell_type": "code",
145
+ "execution_count": null,
146
+ "id": "a2787582-554f-44ce-9f38-4180a5ed6b44",
147
+ "metadata": {},
148
+ "outputs": [],
149
+ "source": [
150
+ "from datasets import IterableDatasetDict\n",
151
+ "\n",
152
+ "raw_datasets = IterableDatasetDict()\n",
153
+ "\n",
154
+ "raw_datasets[\"train\"] = load_streaming_dataset(\"mozilla-foundation/common_voice_11_0\", \"es\", split=\"train\", use_auth_token=True) # set split=\"train+validation\" for low-resource\n",
155
+ "raw_datasets[\"test\"] = load_streaming_dataset(\"mozilla-foundation/common_voice_11_0\", \"es\", split=\"test\", use_auth_token=True)"
156
+ ]
157
+ },
158
+ {
159
+ "cell_type": "markdown",
160
+ "id": "2d63b2d2-f68a-4d74-b7f1-5127f6d16605",
161
+ "metadata": {},
162
+ "source": [
163
+ "## Prepare Processor and Pre-Process Data"
164
+ ]
165
+ },
166
+ {
167
+ "cell_type": "markdown",
168
+ "id": "601c3099-1026-439e-93e2-5635b3ba5a73",
169
+ "metadata": {},
170
+ "source": [
171
+ "The ASR pipeline can be de-composed into three stages: \n",
172
+ "1) A feature extractor which pre-processes the raw audio-inputs\n",
173
+ "2) The model which performs the sequence-to-sequence mapping \n",
174
+ "3) A tokenizer which post-processes the model outputs to text format\n",
175
+ "\n",
176
+ "In 🤗 Transformers, the Whisper model has an associated feature extractor and tokenizer, \n",
177
+ "called [WhisperFeatureExtractor](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperFeatureExtractor)\n",
178
+ "and [WhisperTokenizer](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperTokenizer) \n",
179
+ "respectively. To make our lives simple, these two objects are wrapped under a single class, called the [WhisperProcessor](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor). We can call the WhisperProcessor to perform \n",
180
+ "both the audio pre-processing and the text token post-processing. In doing so, we only need to keep track of two objects during training: \n",
181
+ "the `processor` and the `model`.\n",
182
+ "\n",
183
+ "If using a multilingual checkpoint, you should set the `\"language\"` to your target text language. You should also set the task to `\"transcribe\"` for speech recogntition and `\"translate\"` for speech translation. These arguments modify the behaviour of the tokenizer - they should be set correctly to ensure the target labels are encoded properly. These arguments should be omitted for English-only fine-tuning."
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "code",
188
+ "execution_count": null,
189
+ "id": "77d9f0c5-8607-4642-a8ac-c3ab2e223ea6",
190
+ "metadata": {},
191
+ "outputs": [],
192
+ "source": [
193
+ "from transformers import WhisperProcessor\n",
194
+ "\n",
195
+ "processor = WhisperProcessor.from_pretrained(\"openai/whisper-small\", language=\"Spanish\", task=\"transcribe\")"
196
+ ]
197
+ },
198
+ {
199
+ "cell_type": "markdown",
200
+ "id": "381acd09-0b0f-4d04-9eb3-f028ac0e5f2c",
201
+ "metadata": {},
202
+ "source": [
203
+ "### Pre-Process Data"
204
+ ]
205
+ },
206
+ {
207
+ "cell_type": "markdown",
208
+ "id": "bf10cd3e-924e-44fc-8790-46e413de7b3d",
209
+ "metadata": {},
210
+ "source": [
211
+ "Let's have a look at the dataset features. Pay particular attention to the `\"audio\"` column - this details the sampling rate of our audio inputs:"
212
+ ]
213
+ },
214
+ {
215
+ "cell_type": "code",
216
+ "execution_count": null,
217
+ "id": "ab5a13b4-9bd4-4aa0-aef2-b3de9b762988",
218
+ "metadata": {},
219
+ "outputs": [],
220
+ "source": [
221
+ "raw_datasets[\"train\"].features"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "5a679f05-063d-41b3-9b58-4fc9c6ccf4fd",
227
+ "metadata": {},
228
+ "source": [
229
+ "Since our input audio is sampled at 48kHz, we need to _downsample_ it to\n",
230
+ "16kHz prior to passing it to the Whisper feature extractor, 16kHz being the sampling rate expected by the Whisper model. \n",
231
+ "\n",
232
+ "We'll set the audio inputs to the correct sampling rate using dataset's \n",
233
+ "[`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cast_column#datasets.DatasetDict.cast_column)\n",
234
+ "method. This operation does not change the audio in-place, \n",
235
+ "but rather signals to `datasets` to resample audio samples _on the fly_ the \n",
236
+ "first time that they are loaded:"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "code",
241
+ "execution_count": null,
242
+ "id": "3ab6a724-3d1e-478b-a9e9-d2f85feb6c39",
243
+ "metadata": {},
244
+ "outputs": [],
245
+ "source": [
246
+ "from datasets import Audio\n",
247
+ "\n",
248
+ "raw_datasets = raw_datasets.cast_column(\"audio\", Audio(sampling_rate=16000))"
249
+ ]
250
+ },
251
+ {
252
+ "cell_type": "markdown",
253
+ "id": "161322c2-94f3-4d26-9e1d-d9d5202ca3cf",
254
+ "metadata": {},
255
+ "source": [
256
+ "We'll define our pre-processing strategy. We advise that you **do not** lower-case the transcriptions or remove punctuation unless mixing different datasets. This will enable you to fine-tune Whisper models that can predict punctuation and casing. Later, you will see how we can evaluate the predictions without punctuation or casing, so that the models benefit from the WER improvement obtained by normalising the transcriptions while still predicting fully formatted transcriptions."
257
+ ]
258
+ },
259
+ {
260
+ "cell_type": "code",
261
+ "execution_count": null,
262
+ "id": "d041650e-1c48-4439-87b3-5b6f4a514107",
263
+ "metadata": {},
264
+ "outputs": [],
265
+ "source": [
266
+ "from transformers.models.whisper.english_normalizer import BasicTextNormalizer\n",
267
+ "\n",
268
+ "do_lower_case = False\n",
269
+ "do_remove_punctuation = False\n",
270
+ "\n",
271
+ "normalizer = BasicTextNormalizer()"
272
+ ]
273
+ },
274
+ {
275
+ "cell_type": "markdown",
276
+ "id": "bfaa935b-a11d-497c-88c1-0c4d1bb3247b",
277
+ "metadata": {},
278
+ "source": [
279
+ "Now we can write a function to prepare our data ready for the model:\n",
280
+ "1. We load and resample the audio data by calling `batch[\"audio\"]`. As explained above, 🤗 Datasets performs any necessary resampling operations on the fly.\n",
281
+ "2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.\n",
282
+ "3. We perform any optional pre-processing (lower-case or remove punctuation).\n",
283
+ "4. We encode the transcriptions to label ids through the use of the tokenizer."
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "code",
288
+ "execution_count": null,
289
+ "id": "c085911c-a10a-41ef-8874-306e0503e9bb",
290
+ "metadata": {},
291
+ "outputs": [],
292
+ "source": [
293
+ "def prepare_dataset(batch):\n",
294
+ " # load and (possibly) resample audio data to 16kHz\n",
295
+ " audio = batch[\"audio\"]\n",
296
+ "\n",
297
+ " # compute log-Mel input features from input audio array \n",
298
+ " batch[\"input_features\"] = processor.feature_extractor(audio[\"array\"], sampling_rate=audio[\"sampling_rate\"]).input_features[0]\n",
299
+ " # compute input length of audio sample in seconds\n",
300
+ " batch[\"input_length\"] = len(audio[\"array\"]) / audio[\"sampling_rate\"]\n",
301
+ " \n",
302
+ " # optional pre-processing steps\n",
303
+ " transcription = batch[\"sentence\"]\n",
304
+ " if do_lower_case:\n",
305
+ " transcription = transcription.lower()\n",
306
+ " if do_remove_punctuation:\n",
307
+ " transcription = normalizer(transcription).strip()\n",
308
+ " \n",
309
+ " # encode target text to label ids\n",
310
+ " batch[\"labels\"] = processor.tokenizer(transcription).input_ids\n",
311
+ " return batch"
312
+ ]
313
+ },
314
+ {
315
+ "cell_type": "markdown",
316
+ "id": "70b319fb-2439-4ef6-a70d-a47bf41c4a13",
317
+ "metadata": {},
318
+ "source": [
319
+ "We can apply the data preparation function to all of our training examples using 🤗 Datasets' `.map` method. We'll remove all of the columns from the raw training data, leaving just the `input_features` and `labels` defined in the `prepare_dataset` function:"
320
+ ]
321
+ },
322
+ {
323
+ "cell_type": "code",
324
+ "execution_count": null,
325
+ "id": "a37a7cdb-9013-427f-8de9-6a8d0e9dc684",
326
+ "metadata": {},
327
+ "outputs": [],
328
+ "source": [
329
+ "vectorized_datasets = raw_datasets.map(prepare_dataset, remove_columns=list(next(iter(raw_datasets.values())).features)).with_format(\"torch\")"
330
+ ]
331
+ },
332
+ {
333
+ "cell_type": "markdown",
334
+ "id": "3d59b37e-4950-47ec-9e3e-2cf2ec7fc750",
335
+ "metadata": {},
336
+ "source": [
337
+ "We can now define how we shuffle the data in the train split. The size of the subset we load is set by the variable `buffer_size`. You can increase or decrease this depending on your memory constraints. In this example, the `buffer_size` is set to 500, meaning 500 samples are loaded before shuffling across the subset. The larger we set this value, the closer to True offline shuffling. The `seed` is set for reproducibility:"
338
+ ]
339
+ },
340
+ {
341
+ "cell_type": "code",
342
+ "execution_count": null,
343
+ "id": "1b145699-acfc-4b1d-93a2-a2ad3d62674c",
344
+ "metadata": {},
345
+ "outputs": [],
346
+ "source": [
347
+ "vectorized_datasets[\"train\"] = vectorized_datasets[\"train\"].shuffle(\n",
348
+ " buffer_size=500,\n",
349
+ " seed=0,\n",
350
+ ")"
351
+ ]
352
+ },
353
+ {
354
+ "cell_type": "markdown",
355
+ "id": "666b9ef0-7909-4e1e-a419-87604d233e29",
356
+ "metadata": {},
357
+ "source": [
358
+ "Finally, we filter any training data with audio samples longer than 30s. These samples would otherwise be truncated by the Whisper feature-extractor which could affect the stability of training. We define a function that returns `True` for samples that are less than 30s, and `False` for those that are longer:"
359
+ ]
360
+ },
361
+ {
362
+ "cell_type": "code",
363
+ "execution_count": null,
364
+ "id": "01cb25ef-4bb0-4325-9461-f59198acadf6",
365
+ "metadata": {},
366
+ "outputs": [],
367
+ "source": [
368
+ "max_input_length = 30.0\n",
369
+ "\n",
370
+ "def is_audio_in_length_range(length):\n",
371
+ " return length < max_input_length"
372
+ ]
373
+ },
374
+ {
375
+ "cell_type": "markdown",
376
+ "id": "28e37ac3-b1c5-465b-8586-7cfd8d76b0f1",
377
+ "metadata": {},
378
+ "source": [
379
+ "We apply our filter function to all samples of our training dataset through 🤗 Datasets' `.filter` method:"
380
+ ]
381
+ },
382
+ {
383
+ "cell_type": "code",
384
+ "execution_count": null,
385
+ "id": "333f7f6e-6053-4d3b-8924-c733c79b82ac",
386
+ "metadata": {},
387
+ "outputs": [],
388
+ "source": [
389
+ "vectorized_datasets[\"train\"] = vectorized_datasets[\"train\"].filter(\n",
390
+ " is_audio_in_length_range,\n",
391
+ " input_columns=[\"input_length\"],\n",
392
+ ")"
393
+ ]
394
+ },
395
+ {
396
+ "cell_type": "markdown",
397
+ "id": "263a5a58-0239-4a25-b0df-c625fc9c5810",
398
+ "metadata": {},
399
+ "source": [
400
+ "## Training and Evaluation"
401
+ ]
402
+ },
403
+ {
404
+ "cell_type": "markdown",
405
+ "id": "a693e768-c5a6-453f-89a1-b601dcf7daf7",
406
+ "metadata": {},
407
+ "source": [
408
+ "Now that we've prepared our data, we're ready to dive into the training pipeline. \n",
409
+ "The [🤗 Trainer](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer)\n",
410
+ "will do much of the heavy lifting for us. All we have to do is:\n",
411
+ "\n",
412
+ "- Define a data collator: the data collator takes our pre-processed data and prepares PyTorch tensors ready for the model.\n",
413
+ "\n",
414
+ "- Evaluation metrics: during evaluation, we want to evaluate the model using the [word error rate (WER)](https://huggingface.co/metrics/wer) metric. We need to define a `compute_metrics` function that handles this computation.\n",
415
+ "\n",
416
+ "- Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.\n",
417
+ "\n",
418
+ "- Define the training configuration: this will be used by the 🤗 Trainer to define the training schedule."
419
+ ]
420
+ },
421
+ {
422
+ "cell_type": "markdown",
423
+ "id": "8d230e6d-624c-400a-bbf5-fa660881df25",
424
+ "metadata": {},
425
+ "source": [
426
+ "### Define a Data Collator"
427
+ ]
428
+ },
429
+ {
430
+ "cell_type": "markdown",
431
+ "id": "04def221-0637-4a69-b242-d3f0c1d0ee78",
432
+ "metadata": {},
433
+ "source": [
434
+ "The data collator for a sequence-to-sequence speech model is unique in the sense that it \n",
435
+ "treats the `input_features` and `labels` independently: the `input_features` must be \n",
436
+ "handled by the feature extractor and the `labels` by the tokenizer.\n",
437
+ "\n",
438
+ "The `input_features` are already padded to 30s and converted to a log-Mel spectrogram \n",
439
+ "of fixed dimension by action of the feature extractor, so all we have to do is convert the `input_features`\n",
440
+ "to batched PyTorch tensors. We do this using the feature extractor's `.pad` method with `return_tensors=pt`.\n",
441
+ "\n",
442
+ "The `labels` on the other hand are un-padded. We first pad the sequences\n",
443
+ "to the maximum length in the batch using the tokenizer's `.pad` method. The padding tokens \n",
444
+ "are then replaced by `-100` so that these tokens are **not** taken into account when \n",
445
+ "computing the loss. We then cut the BOS token from the start of the label sequence as we \n",
446
+ "append it later during training.\n",
447
+ "\n",
448
+ "We can leverage the `WhisperProcessor` we defined earlier to perform both the \n",
449
+ "feature extractor and the tokenizer operations:"
450
+ ]
451
+ },
452
+ {
453
+ "cell_type": "code",
454
+ "execution_count": null,
455
+ "id": "8326221e-ec13-4731-bb4e-51e5fc1486c5",
456
+ "metadata": {},
457
+ "outputs": [],
458
+ "source": [
459
+ "import torch\n",
460
+ "\n",
461
+ "from dataclasses import dataclass\n",
462
+ "from typing import Any, Dict, List, Union\n",
463
+ "\n",
464
+ "@dataclass\n",
465
+ "class DataCollatorSpeechSeq2SeqWithPadding:\n",
466
+ " processor: Any\n",
467
+ "\n",
468
+ " def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:\n",
469
+ " # split inputs and labels since they have to be of different lengths and need different padding methods\n",
470
+ " # first treat the audio inputs by simply returning torch tensors\n",
471
+ " input_features = [{\"input_features\": feature[\"input_features\"]} for feature in features]\n",
472
+ " batch = self.processor.feature_extractor.pad(input_features, return_tensors=\"pt\")\n",
473
+ "\n",
474
+ " # get the tokenized label sequences\n",
475
+ " label_features = [{\"input_ids\": feature[\"labels\"]} for feature in features]\n",
476
+ " # pad the labels to max length\n",
477
+ " labels_batch = self.processor.tokenizer.pad(label_features, return_tensors=\"pt\")\n",
478
+ "\n",
479
+ " # replace padding with -100 to ignore loss correctly\n",
480
+ " labels = labels_batch[\"input_ids\"].masked_fill(labels_batch.attention_mask.ne(1), -100)\n",
481
+ "\n",
482
+ " # if bos token is appended in previous tokenization step,\n",
483
+ " # cut bos token here as it's append later anyways\n",
484
+ " if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():\n",
485
+ " labels = labels[:, 1:]\n",
486
+ "\n",
487
+ " batch[\"labels\"] = labels\n",
488
+ "\n",
489
+ " return batch"
490
+ ]
491
+ },
492
+ {
493
+ "cell_type": "markdown",
494
+ "id": "3cae7dbf-8a50-456e-a3a8-7fd005390f86",
495
+ "metadata": {},
496
+ "source": [
497
+ "Let's initialise the data collator we've just defined:"
498
+ ]
499
+ },
500
+ {
501
+ "cell_type": "code",
502
+ "execution_count": null,
503
+ "id": "fc834702-c0d3-4a96-b101-7b87be32bf42",
504
+ "metadata": {},
505
+ "outputs": [],
506
+ "source": [
507
+ "data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)"
508
+ ]
509
+ },
510
+ {
511
+ "cell_type": "markdown",
512
+ "id": "d62bb2ab-750a-45e7-82e9-61d6f4805698",
513
+ "metadata": {},
514
+ "source": [
515
+ "### Evaluation Metrics"
516
+ ]
517
+ },
518
+ {
519
+ "cell_type": "markdown",
520
+ "id": "66fee1a7-a44c-461e-b047-c3917221572e",
521
+ "metadata": {},
522
+ "source": [
523
+ "We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing \n",
524
+ "ASR systems. For more information, refer to the WER [docs](https://huggingface.co/metrics/wer). We'll load the WER metric from 🤗 Evaluate:"
525
+ ]
526
+ },
527
+ {
528
+ "cell_type": "code",
529
+ "execution_count": null,
530
+ "id": "b22b4011-f31f-4b57-b684-c52332f92890",
531
+ "metadata": {},
532
+ "outputs": [],
533
+ "source": [
534
+ "import evaluate\n",
535
+ "\n",
536
+ "metric = evaluate.load(\"wer\")"
537
+ ]
538
+ },
539
+ {
540
+ "cell_type": "markdown",
541
+ "id": "509f96d7-3f11-4f37-add9-f74a0c44f3fc",
542
+ "metadata": {},
543
+ "source": [
544
+ "We then simply have to define a function that takes our model \n",
545
+ "predictions and returns the WER metric. This function, called\n",
546
+ "`compute_metrics`, first replaces `-100` with the `pad_token_id`\n",
547
+ "in the `label_ids` (undoing the step we applied in the \n",
548
+ "data collator to ignore padded tokens correctly in the loss).\n",
549
+ "It then decodes the predicted and label ids to strings. Finally,\n",
550
+ "it computes the WER between the predictions and reference labels. \n",
551
+ "Here, we have the option of evaluating with the 'normalised' transcriptions \n",
552
+ "and predictions. We recommend you set this to `True` to benefit from the WER \n",
553
+ "improvement obtained by normalising the transcriptions."
554
+ ]
555
+ },
556
+ {
557
+ "cell_type": "code",
558
+ "execution_count": null,
559
+ "id": "a11d1bfc-9e28-460f-a287-72d8f7bc1acb",
560
+ "metadata": {},
561
+ "outputs": [],
562
+ "source": [
563
+ "# evaluate with the 'normalised' WER\n",
564
+ "do_normalize_eval = True\n",
565
+ "\n",
566
+ "def compute_metrics(pred):\n",
567
+ " pred_ids = pred.predictions\n",
568
+ " label_ids = pred.label_ids\n",
569
+ "\n",
570
+ " # replace -100 with the pad_token_id\n",
571
+ " label_ids[label_ids == -100] = processor.tokenizer.pad_token_id\n",
572
+ "\n",
573
+ " # we do not want to group tokens when computing the metrics\n",
574
+ " pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)\n",
575
+ " label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)\n",
576
+ "\n",
577
+ " if do_normalize_eval:\n",
578
+ " pred_str = [normalizer(pred) for pred in pred_str]\n",
579
+ " label_str = [normalizer(label) for label in label_str]\n",
580
+ " # filtering step to only evaluate the samples that correspond to non-zero references:\n",
581
+ " pred_str = [pred_str[i] for i in range(len(pred_str)) if len(label_str[i]) > 0]\n",
582
+ " label_str = [label_str[i] for i in range(len(label_str)) if len(label_str[i]) > 0]\n",
583
+ "\n",
584
+ " wer = 100 * metric.compute(predictions=pred_str, references=label_str)\n",
585
+ "\n",
586
+ " return {\"wer\": wer}"
587
+ ]
588
+ },
589
+ {
590
+ "cell_type": "markdown",
591
+ "id": "daf2a825-6d9f-4a23-b145-c37c0039075b",
592
+ "metadata": {},
593
+ "source": [
594
+ "### Load a Pre-Trained Checkpoint"
595
+ ]
596
+ },
597
+ {
598
+ "cell_type": "markdown",
599
+ "id": "437a97fa-4864-476b-8abc-f28b8166cfa5",
600
+ "metadata": {},
601
+ "source": [
602
+ "Now let's load the pre-trained Whisper `small` checkpoint. Again, this \n",
603
+ "is trivial through use of 🤗 Transformers!"
604
+ ]
605
+ },
606
+ {
607
+ "cell_type": "code",
608
+ "execution_count": null,
609
+ "id": "5a10cc4b-07ec-4ebd-ac1d-7c601023594f",
610
+ "metadata": {},
611
+ "outputs": [],
612
+ "source": [
613
+ "from transformers import WhisperForConditionalGeneration\n",
614
+ "\n",
615
+ "model = WhisperForConditionalGeneration.from_pretrained(\"openai/whisper-small\")"
616
+ ]
617
+ },
618
+ {
619
+ "cell_type": "markdown",
620
+ "id": "a15ead5f-2277-4a39-937b-585c2497b2df",
621
+ "metadata": {},
622
+ "source": [
623
+ "Override generation arguments - no tokens are forced as decoder outputs (see [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)), no tokens are suppressed during generation (see [`suppress_tokens`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.suppress_tokens)). Set `use_cache` to False since we're using gradient checkpointing, and the two are incompatible:"
624
+ ]
625
+ },
626
+ {
627
+ "cell_type": "code",
628
+ "execution_count": null,
629
+ "id": "62038ba3-88ed-4fce-84db-338f50dcd04f",
630
+ "metadata": {},
631
+ "outputs": [],
632
+ "source": [
633
+ "model.config.forced_decoder_ids = None\n",
634
+ "model.config.suppress_tokens = []\n",
635
+ "model.config.use_cache = False"
636
+ ]
637
+ },
638
+ {
639
+ "cell_type": "markdown",
640
+ "id": "2178dea4-80ca-47b6-b6ea-ba1915c90c06",
641
+ "metadata": {},
642
+ "source": [
643
+ "### Define the Training Configuration"
644
+ ]
645
+ },
646
+ {
647
+ "cell_type": "markdown",
648
+ "id": "c21af1e9-0188-4134-ac82-defc7bdcc436",
649
+ "metadata": {},
650
+ "source": [
651
+ "In the final step, we define all the parameters related to training. Here, you can set the `max_steps` to train for longer. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments)."
652
+ ]
653
+ },
654
+ {
655
+ "cell_type": "code",
656
+ "execution_count": null,
657
+ "id": "0ae3e9af-97b7-4aa0-ae85-20b23b5bcb3a",
658
+ "metadata": {},
659
+ "outputs": [],
660
+ "source": [
661
+ "from transformers import Seq2SeqTrainingArguments\n",
662
+ "\n",
663
+ "training_args = Seq2SeqTrainingArguments(\n",
664
+ " output_dir=\"./\",\n",
665
+ " per_device_train_batch_size=64,\n",
666
+ " gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size\n",
667
+ " learning_rate=1e-5,\n",
668
+ " warmup_steps=500,\n",
669
+ " max_steps=5000,\n",
670
+ " gradient_checkpointing=True,\n",
671
+ " fp16=True,\n",
672
+ " evaluation_strategy=\"steps\",\n",
673
+ " per_device_eval_batch_size=8,\n",
674
+ " predict_with_generate=True,\n",
675
+ " generation_max_length=225,\n",
676
+ " save_steps=1000,\n",
677
+ " eval_steps=1000,\n",
678
+ " logging_steps=25,\n",
679
+ " report_to=[\"tensorboard\"],\n",
680
+ " load_best_model_at_end=True,\n",
681
+ " metric_for_best_model=\"wer\",\n",
682
+ " greater_is_better=False,\n",
683
+ " push_to_hub=True,\n",
684
+ ")"
685
+ ]
686
+ },
687
+ {
688
+ "cell_type": "markdown",
689
+ "id": "b3a944d8-3112-4552-82a0-be25988b3857",
690
+ "metadata": {},
691
+ "source": [
692
+ "**Note**: if one does not want to upload the model checkpoints to the Hub, \n",
693
+ "set `push_to_hub=False`."
694
+ ]
695
+ },
696
+ {
697
+ "cell_type": "markdown",
698
+ "id": "393c883e-3e50-492c-bd58-f51dbf15ee56",
699
+ "metadata": {},
700
+ "source": [
701
+ "We then define a custom [Callback](https://huggingface.co/docs/transformers/main_classes/callback) that is called by the 🤗 Trainer on the end of each epoch. The Callback reinitialises and reshuffles the streaming dataset at the beginning of each new epoch - this gives different shuffling across our subsets for every epoch."
702
+ ]
703
+ },
704
+ {
705
+ "cell_type": "code",
706
+ "execution_count": null,
707
+ "id": "3ac16b62-b3c0-4c68-8f3d-9ecf471534b2",
708
+ "metadata": {},
709
+ "outputs": [],
710
+ "source": [
711
+ "from transformers import TrainerCallback\n",
712
+ "from transformers.trainer_pt_utils import IterableDatasetShard\n",
713
+ "from torch.utils.data import IterableDataset\n",
714
+ "\n",
715
+ "# trainer callback to reinitialise and reshuffle the streamable datasets at the beginning of each epoch\n",
716
+ "class ShuffleCallback(TrainerCallback):\n",
717
+ " def on_epoch_begin(self, args, state, control, train_dataloader, **kwargs):\n",
718
+ " if isinstance(train_dataloader.dataset, IterableDatasetShard):\n",
719
+ " pass # set_epoch() is handled by the Trainer\n",
720
+ " elif isinstance(train_dataloader.dataset, IterableDataset):\n",
721
+ " train_dataloader.dataset.set_epoch(train_dataloader.dataset._epoch + 1)"
722
+ ]
723
+ },
724
+ {
725
+ "cell_type": "markdown",
726
+ "id": "bac29114-d226-4f54-97cf-8718c9f94e1e",
727
+ "metadata": {},
728
+ "source": [
729
+ "We can forward the training arguments to the 🤗 Trainer along with our model,\n",
730
+ "dataset, data collator, `compute_metrics` function and custom callback:"
731
+ ]
732
+ },
733
+ {
734
+ "cell_type": "code",
735
+ "execution_count": null,
736
+ "id": "d546d7fe-0543-479a-b708-2ebabec19493",
737
+ "metadata": {},
738
+ "outputs": [],
739
+ "source": [
740
+ "from transformers import Seq2SeqTrainer\n",
741
+ "\n",
742
+ "trainer = Seq2SeqTrainer(\n",
743
+ " args=training_args,\n",
744
+ " model=model,\n",
745
+ " train_dataset=vectorized_datasets[\"train\"],\n",
746
+ " eval_dataset=vectorized_datasets[\"test\"],\n",
747
+ " data_collator=data_collator,\n",
748
+ " compute_metrics=compute_metrics,\n",
749
+ " tokenizer=processor,\n",
750
+ " callbacks=[ShuffleCallback()],\n",
751
+ ")"
752
+ ]
753
+ },
754
+ {
755
+ "cell_type": "markdown",
756
+ "id": "67ab88c3-7091-4e51-8ad5-f5cacbe18449",
757
+ "metadata": {},
758
+ "source": [
759
+ "We'll save the model and processor to the output directory before training:"
760
+ ]
761
+ },
762
+ {
763
+ "cell_type": "code",
764
+ "execution_count": null,
765
+ "id": "a1ccb9ed-cbc8-4419-91c0-651e9424b672",
766
+ "metadata": {},
767
+ "outputs": [],
768
+ "source": [
769
+ "model.save_pretrained(training_args.output_dir)\n",
770
+ "processor.save_pretrained(training_args.output_dir)"
771
+ ]
772
+ },
773
+ {
774
+ "cell_type": "markdown",
775
+ "id": "7f404cf9-4345-468c-8196-4bd101d9bd51",
776
+ "metadata": {},
777
+ "source": [
778
+ "### Training"
779
+ ]
780
+ },
781
+ {
782
+ "cell_type": "markdown",
783
+ "id": "5e8b8d56-5a70-4f68-bd2e-f0752d0bd112",
784
+ "metadata": {},
785
+ "source": [
786
+ "Training will take approximately 5-10 hours depending on your GPU. The peak GPU memory for the given training configuration is approximately 36GB. \n",
787
+ "Depending on your GPU, it is possible that you will encounter a CUDA `\"out-of-memory\"` error when you launch training. \n",
788
+ "In this case, you can reduce the `per_device_train_batch_size` incrementally by factors of 2 \n",
789
+ "and employ [`gradient_accumulation_steps`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments.gradient_accumulation_steps)\n",
790
+ "to compensate.\n",
791
+ "\n",
792
+ "To launch training, simply execute:"
793
+ ]
794
+ },
795
+ {
796
+ "cell_type": "code",
797
+ "execution_count": null,
798
+ "id": "ee8b7b8e-1c9a-4d77-9137-1778a629e6de",
799
+ "metadata": {},
800
+ "outputs": [],
801
+ "source": [
802
+ "trainer.train()"
803
+ ]
804
+ },
805
+ {
806
+ "cell_type": "markdown",
807
+ "id": "747c6a6e",
808
+ "metadata": {
809
+ "pycharm": {
810
+ "name": "#%% md\n"
811
+ }
812
+ },
813
+ "source": [
814
+ "(note that training may take some time to commence as we load the first training data samples with streaming mode)"
815
+ ]
816
+ },
817
+ {
818
+ "cell_type": "markdown",
819
+ "id": "810ced54-7187-4a06-b2fe-ba6dcca94dc3",
820
+ "metadata": {},
821
+ "source": [
822
+ "We can label our checkpoint with the `whisper-event` tag on push by setting the appropriate key-word arguments (kwargs):"
823
+ ]
824
+ },
825
+ {
826
+ "cell_type": "code",
827
+ "execution_count": null,
828
+ "id": "6dd0e310-9b07-4133-ac14-2ed2d7524e22",
829
+ "metadata": {},
830
+ "outputs": [],
831
+ "source": [
832
+ "kwargs = {\n",
833
+ " \"dataset_tags\": \"mozilla-foundation/common_voice_11_0\",\n",
834
+ " \"dataset\": \"Common Voice 11.0\", # a 'pretty' name for the training dataset\n",
835
+ " \"language\": \"es\",\n",
836
+ " \"model_name\": \"Whisper Small Es - Sanchit Gandhi\", # a 'pretty' name for your model\n",
837
+ " \"finetuned_from\": \"openai/whisper-small\",\n",
838
+ " \"tasks\": \"automatic-speech-recognition\",\n",
839
+ " \"tags\": \"whisper-event\",\n",
840
+ "}"
841
+ ]
842
+ },
843
+ {
844
+ "cell_type": "markdown",
845
+ "id": "090d676a-f944-4297-a938-a40eda0b2b68",
846
+ "metadata": {},
847
+ "source": [
848
+ "The training results can now be uploaded to the Hub. To do so, execute the `push_to_hub` command:"
849
+ ]
850
+ },
851
+ {
852
+ "cell_type": "code",
853
+ "execution_count": null,
854
+ "id": "95737cda-c5dd-4887-a4d0-dfcb0d61d977",
855
+ "metadata": {},
856
+ "outputs": [],
857
+ "source": [
858
+ "trainer.push_to_hub(**kwargs)"
859
+ ]
860
+ }
861
+ ],
862
+ "metadata": {
863
+ "kernelspec": {
864
+ "display_name": "Python 3 (ipykernel)",
865
+ "language": "python",
866
+ "name": "python3"
867
+ },
868
+ "language_info": {
869
+ "codemirror_mode": {
870
+ "name": "ipython",
871
+ "version": 3
872
+ },
873
+ "file_extension": ".py",
874
+ "mimetype": "text/x-python",
875
+ "name": "python",
876
+ "nbconvert_exporter": "python",
877
+ "pygments_lexer": "ipython3",
878
+ "version": "3.8.9"
879
+ }
880
+ },
881
+ "nbformat": 4,
882
+ "nbformat_minor": 5
883
+ }
interleave_streaming_datasets.ipynb ADDED
@@ -0,0 +1,358 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "6a5c0357",
7
+ "metadata": {
8
+ "collapsed": false,
9
+ "jupyter": {
10
+ "outputs_hidden": false
11
+ },
12
+ "pycharm": {
13
+ "name": "#%%\n"
14
+ }
15
+ },
16
+ "outputs": [],
17
+ "source": [
18
+ "# Ensure datasets is installed from main. Uncomment the following line if you face issues running this script:\n",
19
+ "# !pip install git+https://github.com/huggingface/datasets"
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "code",
24
+ "execution_count": 2,
25
+ "id": "794aaced",
26
+ "metadata": {
27
+ "collapsed": false,
28
+ "jupyter": {
29
+ "outputs_hidden": false
30
+ },
31
+ "pycharm": {
32
+ "name": "#%%\n"
33
+ }
34
+ },
35
+ "outputs": [],
36
+ "source": [
37
+ "from datasets import Audio, interleave_datasets, IterableDataset, load_dataset\n",
38
+ "from typing import List, Optional"
39
+ ]
40
+ },
41
+ {
42
+ "cell_type": "markdown",
43
+ "id": "f210ca9a-486b-46a2-a675-2526a9bd83f5",
44
+ "metadata": {},
45
+ "source": [
46
+ "### Define the dataset attributes"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "markdown",
51
+ "id": "fc07293f-3ba4-4e89-a4ca-8e39409a8373",
52
+ "metadata": {},
53
+ "source": [
54
+ "In this example, we'll show to combine the Common Voice 11, VoxPopuli, Mulitlingual LibriSpeech and FLEURS datasets for Spanish, giving a training corpus equal to the sum of the individual datasets. This is particularly beneficial in low-resource settings, where any one of the datasets alone might have insufficient data to train a model.\n",
55
+ "\n",
56
+ "We need to specify the dataset names on the Hub, the corresponding configs and finally the text column names for the transcriptions:"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": 3,
62
+ "id": "c53344f3-c315-430a-a2f3-57aea6bb0e17",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "dataset_names = [\"mozilla-foundation/common_voice_11_0\", \"facebook/voxpopuli\", \"facebook/multilingual_librispeech\", \"google/fleurs\"]\n",
67
+ "dataset_config_names = [\"es\", \"es\", \"spanish\", \"es_419\"]\n",
68
+ "text_column_names = [\"sentence\", \"normalized_text\", \"text\", \"transcription\"]"
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "markdown",
73
+ "id": "215541f6-ee1c-4104-b43c-fa3f7fce0494",
74
+ "metadata": {},
75
+ "source": [
76
+ "### Define the merging function"
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "markdown",
81
+ "id": "b722a48b-c576-4a63-b2a2-3c264890a75f",
82
+ "metadata": {},
83
+ "source": [
84
+ "We define a function, `load_multiple_streaming_datasets`, that takes as argument a list of datasets, configs, splits (optional) and text column names (optional). It sets them to a specified sampling rate and interleaves them together, giving one merged dataset. This is all \n",
85
+ "done in _streaming mode_: as we iterate over the merged dataset we load samples one-by-one on the fly. No data is\n",
86
+ "saved to disk.\n",
87
+ "\n",
88
+ "We can also specify our strategy for interleaving datasets. The default strategy, `all_exhausted` is an oversampling \n",
89
+ "strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset \n",
90
+ "has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the \n",
91
+ "beginning of this dataset until the stop criterion has been reached. You can specify `stopping_strategy=first_exhausted` \n",
92
+ "for a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples. "
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "code",
97
+ "execution_count": 4,
98
+ "id": "61eb4cb1-ee27-4270-a474-1bb33e1df65f",
99
+ "metadata": {},
100
+ "outputs": [],
101
+ "source": [
102
+ "def load_multiple_streaming_datasets(\n",
103
+ " dataset_names: List,\n",
104
+ " dataset_config_names: List,\n",
105
+ " splits: Optional[List] = None,\n",
106
+ " text_column_names: Optional[List] = None,\n",
107
+ " sampling_rate: Optional[int] = 16000,\n",
108
+ " stopping_strategy: Optional[str] = \"all_exhausted\",\n",
109
+ " **kwargs\n",
110
+ ") -> IterableDataset:\n",
111
+ "\n",
112
+ " if len(dataset_names) != len(dataset_config_names):\n",
113
+ " raise ValueError(\n",
114
+ " f\"Ensure one config is passed for each dataset, got {len(dataset_names)} datasets and\"\n",
115
+ " f\" {len(dataset_config_names)} configs.\"\n",
116
+ " )\n",
117
+ "\n",
118
+ " if splits is not None and len(splits) != len(dataset_names):\n",
119
+ " raise ValueError(\n",
120
+ " f\"Ensure one split is passed for each dataset, got {len(dataset_names)} datasets and {len(splits)} splits.\"\n",
121
+ " )\n",
122
+ "\n",
123
+ " if text_column_names is not None and len(text_column_names) != len(dataset_names):\n",
124
+ " raise ValueError(\n",
125
+ " f\"Ensure one text column name is passed for each dataset, got {len(dataset_names)} datasets and\"\n",
126
+ " f\" {len(text_column_names)} text column names.\"\n",
127
+ " )\n",
128
+ "\n",
129
+ " splits = splits if splits is not None else [\"train\" for i in range(len(dataset_names))]\n",
130
+ " text_column_names = (\n",
131
+ " text_column_names if text_column_names is not None else [\"text\" for i in range(len(dataset_names))]\n",
132
+ " )\n",
133
+ "\n",
134
+ " all_datasets = []\n",
135
+ " # iterate over the datasets we want to interleave\n",
136
+ " for i, dataset_name in enumerate(dataset_names):\n",
137
+ " dataset = load_dataset(dataset_name, dataset_config_names[i], split=splits[i], streaming=True, **kwargs)\n",
138
+ " # resample to specified sampling rate\n",
139
+ " dataset = dataset.cast_column(\"audio\", Audio(sampling_rate))\n",
140
+ " #  normalise columns to [\"audio\", \"sentence\"]\n",
141
+ " if text_column_names[i] != \"sentence\":\n",
142
+ " dataset = dataset.rename_column(text_column_names[i], \"sentence\")\n",
143
+ " dataset = dataset.remove_columns(set(dataset.features.keys()) - set([\"audio\", \"sentence\"]))\n",
144
+ " all_datasets.append(dataset)\n",
145
+ "\n",
146
+ " interleaved_dataset = interleave_datasets(all_datasets, stopping_strategy=stopping_strategy)\n",
147
+ " return interleaved_dataset"
148
+ ]
149
+ },
150
+ {
151
+ "cell_type": "markdown",
152
+ "id": "29bc228b-ce9b-4cee-9092-1223ddfa51ad",
153
+ "metadata": {},
154
+ "source": [
155
+ "Let's apply this function to load and merge our four datasets:"
156
+ ]
157
+ },
158
+ {
159
+ "cell_type": "code",
160
+ "execution_count": 5,
161
+ "id": "8ae90f83-4ecd-46a3-98be-bd75706e0d88",
162
+ "metadata": {},
163
+ "outputs": [],
164
+ "source": [
165
+ "ds = load_multiple_streaming_datasets(dataset_names, dataset_config_names=dataset_config_names, text_column_names=text_column_names, use_auth_token=True)"
166
+ ]
167
+ },
168
+ {
169
+ "cell_type": "markdown",
170
+ "id": "6056a693-1fb0-45f4-ad43-be5f1812c1a5",
171
+ "metadata": {},
172
+ "source": [
173
+ "### Iterate over the dataset"
174
+ ]
175
+ },
176
+ {
177
+ "cell_type": "markdown",
178
+ "id": "7ffe011f-f905-4027-ab67-5c9c3b2b5ac0",
179
+ "metadata": {},
180
+ "source": [
181
+ "We iterate over the dataset, loading and merging samples on the fly. Let's print the transcriptions for the first 10 samples of our merged dataset:"
182
+ ]
183
+ },
184
+ {
185
+ "cell_type": "code",
186
+ "execution_count": 6,
187
+ "id": "75b3355a-3c06-4d23-af43-2b93b1ad70b2",
188
+ "metadata": {},
189
+ "outputs": [
190
+ {
191
+ "name": "stderr",
192
+ "output_type": "stream",
193
+ "text": [
194
+ "Reading metadata...: 230467it [00:41, 5545.80it/s]\n"
195
+ ]
196
+ },
197
+ {
198
+ "name": "stdout",
199
+ "output_type": "stream",
200
+ "text": [
201
+ "0 ¿ Qué tal a tres de cinco ?\n",
202
+ "1 y desde luego esa razón no puede tener que ver con la explicación surrealista que hemos escuchado más de una vez de que se trata de una conspiración izquierdista.\n",
203
+ "2 para exclamar con voz de acción de gracias y para contar todas tus maravillas jehová la habitación de tu casa he amado y el lugar del tabernáculo de tu gloria no juntes con los pecadores mi alma ni con los hombres de sangres mi vida\n",
204
+ "3 el uso de internet y de la red informática mundial permite que los estudiantes tengan acceso a la información en todo momento\n",
205
+ "4 vamos , quiero decir , que no soy de citas especiales .\n",
206
+ "5 si bien esta lista no es perfecta sí que resulta necesario que las entidades financieras refuercen sus controles.\n",
207
+ "6 oye oh jehová mi voz con que á ti clamo y ten misericordia de mí respóndeme mi corazón ha dicho de ti buscad mi rostro tu rostro buscaré oh jehová\n",
208
+ "7 los deportes de nieve en descenso como el esquí y la tablanieve son disciplinas populares que consisten en deslizarse con esquís o una tabla fijada a los pies sobre un terreno nevado\n",
209
+ "8 fray Lope , en aquel momento , colmaba otro vaso igual :\n",
210
+ "9 señora presidenta la competitividad es importante pero no puede ser el único criterio.\n"
211
+ ]
212
+ }
213
+ ],
214
+ "source": [
215
+ "for i, sample in enumerate(ds):\n",
216
+ " print(i, sample[\"sentence\"])\n",
217
+ " if i == 9:\n",
218
+ " break"
219
+ ]
220
+ },
221
+ {
222
+ "cell_type": "markdown",
223
+ "id": "42d5ad08-b20e-4cba-a1a9-909fdbf030d4",
224
+ "metadata": {},
225
+ "source": [
226
+ "We can see that the transcriptions take several different formats. Those from Common Voice 11 are cased and punctuated. Those from VoxPopuli are punctuated only. Those from Multilingual LibriSpeech and FLEURS are neither cased not punctuated. We need to normalise the transcriptions to a uniform format before training our model. \n",
227
+ "\n",
228
+ "The following code cell is lifted from the Whisper training notebook: https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine-tune-whisper-streaming.ipynb"
229
+ ]
230
+ },
231
+ {
232
+ "cell_type": "code",
233
+ "execution_count": 7,
234
+ "id": "ed20e9cd-31c2-44cb-872b-333378a92fd1",
235
+ "metadata": {},
236
+ "outputs": [
237
+ {
238
+ "name": "stderr",
239
+ "output_type": "stream",
240
+ "text": [
241
+ "/Users/sanchitgandhi/venv/lib/python3.8/site-packages/jax/_src/lib/__init__.py:33: UserWarning: JAX on Mac ARM machines is experimental and minimally tested. Please see https://github.com/google/jax/issues/5501 in the event of problems.\n",
242
+ " warnings.warn(\"JAX on Mac ARM machines is experimental and minimally tested. \"\n"
243
+ ]
244
+ }
245
+ ],
246
+ "source": [
247
+ "from transformers.models.whisper.english_normalizer import BasicTextNormalizer\n",
248
+ "\n",
249
+ "do_lower_case = True\n",
250
+ "do_remove_punctuation = True\n",
251
+ "\n",
252
+ "normalizer = BasicTextNormalizer()"
253
+ ]
254
+ },
255
+ {
256
+ "cell_type": "markdown",
257
+ "id": "01d13029-c24f-4a51-aff2-9251a2ceb4ce",
258
+ "metadata": {},
259
+ "source": [
260
+ "Now we define a function to normalise our transcriptions:"
261
+ ]
262
+ },
263
+ {
264
+ "cell_type": "code",
265
+ "execution_count": 8,
266
+ "id": "26e42417-4bd2-46f8-914e-3a6f9f3471ac",
267
+ "metadata": {},
268
+ "outputs": [],
269
+ "source": [
270
+ "def normalize_transcriptions(batch):\n",
271
+ " # optional pre-processing steps\n",
272
+ " transcription = batch[\"sentence\"]\n",
273
+ " if do_lower_case:\n",
274
+ " transcription = transcription.lower()\n",
275
+ " if do_remove_punctuation:\n",
276
+ " transcription = normalizer(transcription).strip()\n",
277
+ " batch[\"sentence\"] = transcription\n",
278
+ " return batch"
279
+ ]
280
+ },
281
+ {
282
+ "cell_type": "markdown",
283
+ "id": "3b1c67fe-be4b-4ee5-9a1f-0d444f2b5c62",
284
+ "metadata": {},
285
+ "source": [
286
+ "Let's apply the data pre-processing steps to our dataset and view the first 10 samples again:"
287
+ ]
288
+ },
289
+ {
290
+ "cell_type": "code",
291
+ "execution_count": 9,
292
+ "id": "0babac71-9157-4d0f-a8a8-184547bdf501",
293
+ "metadata": {},
294
+ "outputs": [
295
+ {
296
+ "name": "stderr",
297
+ "output_type": "stream",
298
+ "text": [
299
+ "Reading metadata...: 230467it [00:32, 6984.59it/s] \n"
300
+ ]
301
+ },
302
+ {
303
+ "name": "stdout",
304
+ "output_type": "stream",
305
+ "text": [
306
+ "0 qué tal a tres de cinco \n",
307
+ "1 y desde luego esa razón no puede tener que ver con la explicación surrealista que hemos escuchado más de una vez de que se trata de una conspiración izquierdista \n",
308
+ "2 para exclamar con voz de acción de gracias y para contar todas tus maravillas jehová la habitación de tu casa he amado y el lugar del tabernáculo de tu gloria no juntes con los pecadores mi alma ni con los hombres de sangres mi vida\n",
309
+ "3 el uso de internet y de la red informática mundial permite que los estudiantes tengan acceso a la información en todo momento\n",
310
+ "4 vamos quiero decir que no soy de citas especiales \n",
311
+ "5 si bien esta lista no es perfecta sí que resulta necesario que las entidades financieras refuercen sus controles \n",
312
+ "6 oye oh jehová mi voz con que á ti clamo y ten misericordia de mí respóndeme mi corazón ha dicho de ti buscad mi rostro tu rostro buscaré oh jehová\n",
313
+ "7 los deportes de nieve en descenso como el esquí y la tablanieve son disciplinas populares que consisten en deslizarse con esquís o una tabla fijada a los pies sobre un terreno nevado\n",
314
+ "8 fray lope en aquel momento colmaba otro vaso igual \n",
315
+ "9 señora presidenta la competitividad es importante pero no puede ser el único criterio \n"
316
+ ]
317
+ }
318
+ ],
319
+ "source": [
320
+ "ds = ds.map(normalize_transcriptions)\n",
321
+ "\n",
322
+ "for i, sample in enumerate(ds):\n",
323
+ " print(i, sample[\"sentence\"])\n",
324
+ " if i == 9:\n",
325
+ " break"
326
+ ]
327
+ },
328
+ {
329
+ "cell_type": "markdown",
330
+ "id": "d135627a-a7aa-458c-94b8-57ddeae74a72",
331
+ "metadata": {},
332
+ "source": [
333
+ "This time the transcriptions are in a consistent format. We can use this data to fine-tune our Whisper model. Note that since we've removed punctuation and casing, the Whisper model won't learn to predict these features."
334
+ ]
335
+ }
336
+ ],
337
+ "metadata": {
338
+ "kernelspec": {
339
+ "display_name": "Python 3 (ipykernel)",
340
+ "language": "python",
341
+ "name": "python3"
342
+ },
343
+ "language_info": {
344
+ "codemirror_mode": {
345
+ "name": "ipython",
346
+ "version": 3
347
+ },
348
+ "file_extension": ".py",
349
+ "mimetype": "text/x-python",
350
+ "name": "python",
351
+ "nbconvert_exporter": "python",
352
+ "pygments_lexer": "ipython3",
353
+ "version": "3.8.9"
354
+ }
355
+ },
356
+ "nbformat": 4,
357
+ "nbformat_minor": 5
358
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
normalizer.json ADDED
@@ -0,0 +1,1742 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "accessorise": "accessorize",
3
+ "accessorised": "accessorized",
4
+ "accessorises": "accessorizes",
5
+ "accessorising": "accessorizing",
6
+ "acclimatisation": "acclimatization",
7
+ "acclimatise": "acclimatize",
8
+ "acclimatised": "acclimatized",
9
+ "acclimatises": "acclimatizes",
10
+ "acclimatising": "acclimatizing",
11
+ "accoutrements": "accouterments",
12
+ "aeon": "eon",
13
+ "aeons": "eons",
14
+ "aerogramme": "aerogram",
15
+ "aerogrammes": "aerograms",
16
+ "aeroplane": "airplane",
17
+ "aeroplanes": "airplanes",
18
+ "aesthete": "esthete",
19
+ "aesthetes": "esthetes",
20
+ "aesthetic": "esthetic",
21
+ "aesthetically": "esthetically",
22
+ "aesthetics": "esthetics",
23
+ "aetiology": "etiology",
24
+ "ageing": "aging",
25
+ "aggrandisement": "aggrandizement",
26
+ "agonise": "agonize",
27
+ "agonised": "agonized",
28
+ "agonises": "agonizes",
29
+ "agonising": "agonizing",
30
+ "agonisingly": "agonizingly",
31
+ "almanack": "almanac",
32
+ "almanacks": "almanacs",
33
+ "aluminium": "aluminum",
34
+ "amortisable": "amortizable",
35
+ "amortisation": "amortization",
36
+ "amortisations": "amortizations",
37
+ "amortise": "amortize",
38
+ "amortised": "amortized",
39
+ "amortises": "amortizes",
40
+ "amortising": "amortizing",
41
+ "amphitheatre": "amphitheater",
42
+ "amphitheatres": "amphitheaters",
43
+ "anaemia": "anemia",
44
+ "anaemic": "anemic",
45
+ "anaesthesia": "anesthesia",
46
+ "anaesthetic": "anesthetic",
47
+ "anaesthetics": "anesthetics",
48
+ "anaesthetise": "anesthetize",
49
+ "anaesthetised": "anesthetized",
50
+ "anaesthetises": "anesthetizes",
51
+ "anaesthetising": "anesthetizing",
52
+ "anaesthetist": "anesthetist",
53
+ "anaesthetists": "anesthetists",
54
+ "anaesthetize": "anesthetize",
55
+ "anaesthetized": "anesthetized",
56
+ "anaesthetizes": "anesthetizes",
57
+ "anaesthetizing": "anesthetizing",
58
+ "analogue": "analog",
59
+ "analogues": "analogs",
60
+ "analyse": "analyze",
61
+ "analysed": "analyzed",
62
+ "analyses": "analyzes",
63
+ "analysing": "analyzing",
64
+ "anglicise": "anglicize",
65
+ "anglicised": "anglicized",
66
+ "anglicises": "anglicizes",
67
+ "anglicising": "anglicizing",
68
+ "annualised": "annualized",
69
+ "antagonise": "antagonize",
70
+ "antagonised": "antagonized",
71
+ "antagonises": "antagonizes",
72
+ "antagonising": "antagonizing",
73
+ "apologise": "apologize",
74
+ "apologised": "apologized",
75
+ "apologises": "apologizes",
76
+ "apologising": "apologizing",
77
+ "appal": "appall",
78
+ "appals": "appalls",
79
+ "appetiser": "appetizer",
80
+ "appetisers": "appetizers",
81
+ "appetising": "appetizing",
82
+ "appetisingly": "appetizingly",
83
+ "arbour": "arbor",
84
+ "arbours": "arbors",
85
+ "archaeologically": "archeologically",
86
+ "archaeologist": "archeologist",
87
+ "archaeologists": "archeologists",
88
+ "archaeology": "archeology</span>",
89
+ "archeological": "archaeological",
90
+ "ardour": "ardor",
91
+ "armour": "armor",
92
+ "armoured": "armored",
93
+ "armourer": "armorer",
94
+ "armourers": "armorers",
95
+ "armouries": "armories",
96
+ "armoury": "armory",
97
+ "artefact": "artifact",
98
+ "artefacts": "artifacts",
99
+ "authorise": "authorize",
100
+ "authorised": "authorized",
101
+ "authorises": "authorizes",
102
+ "authorising": "authorizing",
103
+ "axe": "ax",
104
+ "backpedalled": "backpedaled",
105
+ "backpedalling": "backpedaling",
106
+ "bannister": "banister",
107
+ "bannisters": "banisters",
108
+ "baptise": "baptize",
109
+ "baptised": "baptized",
110
+ "baptises": "baptizes",
111
+ "baptising": "baptizing",
112
+ "bastardise": "bastardize",
113
+ "bastardised": "bastardized",
114
+ "bastardises": "bastardizes",
115
+ "bastardising": "bastardizing",
116
+ "battleax": "battleaxe",
117
+ "baulk": "balk",
118
+ "baulked": "balked",
119
+ "baulking": "balking",
120
+ "baulks": "balks",
121
+ "bedevilled": "bedeviled",
122
+ "bedevilling": "bedeviling",
123
+ "behaviour": "behavior",
124
+ "behavioural": "behavioral",
125
+ "behaviourism": "behaviorism",
126
+ "behaviourist": "behaviorist",
127
+ "behaviourists": "behaviorists",
128
+ "behaviours": "behaviors",
129
+ "behove": "behoove",
130
+ "behoved": "behooved",
131
+ "behoves": "behooves",
132
+ "bejewelled": "bejeweled",
133
+ "belabour": "belabor",
134
+ "belaboured": "belabored",
135
+ "belabouring": "belaboring",
136
+ "belabours": "belabors",
137
+ "bevelled": "beveled",
138
+ "bevvies": "bevies",
139
+ "bevvy": "bevy",
140
+ "biassed": "biased",
141
+ "biassing": "biasing",
142
+ "bingeing": "binging",
143
+ "bougainvillaea": "bougainvillea",
144
+ "bougainvillaeas": "bougainvilleas",
145
+ "bowdlerise": "bowdlerize",
146
+ "bowdlerised": "bowdlerized",
147
+ "bowdlerises": "bowdlerizes",
148
+ "bowdlerising": "bowdlerizing",
149
+ "breathalyse": "breathalyze",
150
+ "breathalysed": "breathalyzed",
151
+ "breathalyser": "breathalyzer",
152
+ "breathalysers": "breathalyzers",
153
+ "breathalyses": "breathalyzes",
154
+ "breathalysing": "breathalyzing",
155
+ "brutalise": "brutalize",
156
+ "brutalised": "brutalized",
157
+ "brutalises": "brutalizes",
158
+ "brutalising": "brutalizing",
159
+ "busses": "buses",
160
+ "bussing": "busing",
161
+ "caesarean": "cesarean",
162
+ "caesareans": "cesareans",
163
+ "calibre": "caliber",
164
+ "calibres": "calibers",
165
+ "calliper": "caliper",
166
+ "callipers": "calipers",
167
+ "callisthenics": "calisthenics",
168
+ "canalise": "canalize",
169
+ "canalised": "canalized",
170
+ "canalises": "canalizes",
171
+ "canalising": "canalizing",
172
+ "cancelation": "cancellation",
173
+ "cancelations": "cancellations",
174
+ "cancelled": "canceled",
175
+ "cancelling": "canceling",
176
+ "candour": "candor",
177
+ "cannibalise": "cannibalize",
178
+ "cannibalised": "cannibalized",
179
+ "cannibalises": "cannibalizes",
180
+ "cannibalising": "cannibalizing",
181
+ "canonise": "canonize",
182
+ "canonised": "canonized",
183
+ "canonises": "canonizes",
184
+ "canonising": "canonizing",
185
+ "capitalise": "capitalize",
186
+ "capitalised": "capitalized",
187
+ "capitalises": "capitalizes",
188
+ "capitalising": "capitalizing",
189
+ "caramelise": "caramelize",
190
+ "caramelised": "caramelized",
191
+ "caramelises": "caramelizes",
192
+ "caramelising": "caramelizing",
193
+ "carbonise": "carbonize",
194
+ "carbonised": "carbonized",
195
+ "carbonises": "carbonizes",
196
+ "carbonising": "carbonizing",
197
+ "carolled": "caroled",
198
+ "carolling": "caroling",
199
+ "catalogue": "catalog",
200
+ "catalogued": "cataloged",
201
+ "catalogues": "catalogs",
202
+ "cataloguing": "cataloging",
203
+ "catalyse": "catalyze",
204
+ "catalysed": "catalyzed",
205
+ "catalyses": "catalyzes",
206
+ "catalysing": "catalyzing",
207
+ "categorise": "categorize",
208
+ "categorised": "categorized",
209
+ "categorises": "categorizes",
210
+ "categorising": "categorizing",
211
+ "cauterise": "cauterize",
212
+ "cauterised": "cauterized",
213
+ "cauterises": "cauterizes",
214
+ "cauterising": "cauterizing",
215
+ "cavilled": "caviled",
216
+ "cavilling": "caviling",
217
+ "centigramme": "centigram",
218
+ "centigrammes": "centigrams",
219
+ "centilitre": "centiliter",
220
+ "centilitres": "centiliters",
221
+ "centimetre": "centimeter",
222
+ "centimetres": "centimeters",
223
+ "centralise": "centralize",
224
+ "centralised": "centralized",
225
+ "centralises": "centralizes",
226
+ "centralising": "centralizing",
227
+ "centre": "center",
228
+ "centred": "centered",
229
+ "centrefold": "centerfold",
230
+ "centrefolds": "centerfolds",
231
+ "centrepiece": "centerpiece",
232
+ "centrepieces": "centerpieces",
233
+ "centres": "centers",
234
+ "channelled": "channeled",
235
+ "channelling": "channeling",
236
+ "characterise": "characterize",
237
+ "characterised": "characterized",
238
+ "characterises": "characterizes",
239
+ "characterising": "characterizing",
240
+ "cheque": "check",
241
+ "chequebook": "checkbook",
242
+ "chequebooks": "checkbooks",
243
+ "chequered": "checkered",
244
+ "cheques": "checks",
245
+ "chilli": "chili",
246
+ "chimaera": "chimera",
247
+ "chimaeras": "chimeras",
248
+ "chiselled": "chiseled",
249
+ "chiselling": "chiseling",
250
+ "circularise": "circularize",
251
+ "circularised": "circularized",
252
+ "circularises": "circularizes",
253
+ "circularising": "circularizing",
254
+ "civilise": "civilize",
255
+ "civilised": "civilized",
256
+ "civilises": "civilizes",
257
+ "civilising": "civilizing",
258
+ "clamour": "clamor",
259
+ "clamoured": "clamored",
260
+ "clamouring": "clamoring",
261
+ "clamours": "clamors",
262
+ "clangour": "clangor",
263
+ "clarinettist": "clarinetist",
264
+ "clarinettists": "clarinetists",
265
+ "collectivise": "collectivize",
266
+ "collectivised": "collectivized",
267
+ "collectivises": "collectivizes",
268
+ "collectivising": "collectivizing",
269
+ "colonisation": "colonization",
270
+ "colonise": "colonize",
271
+ "colonised": "colonized",
272
+ "coloniser": "colonizer",
273
+ "colonisers": "colonizers",
274
+ "colonises": "colonizes",
275
+ "colonising": "colonizing",
276
+ "colour": "color",
277
+ "colourant": "colorant",
278
+ "colourants": "colorants",
279
+ "coloured": "colored",
280
+ "coloureds": "coloreds",
281
+ "colourful": "colorful",
282
+ "colourfully": "colorfully",
283
+ "colouring": "coloring",
284
+ "colourize": "colorize",
285
+ "colourized": "colorized",
286
+ "colourizes": "colorizes",
287
+ "colourizing": "colorizing",
288
+ "colourless": "colorless",
289
+ "colours": "colors",
290
+ "commercialise": "commercialize",
291
+ "commercialised": "commercialized",
292
+ "commercialises": "commercializes",
293
+ "commercialising": "commercializing",
294
+ "compartmentalise": "compartmentalize",
295
+ "compartmentalised": "compartmentalized",
296
+ "compartmentalises": "compartmentalizes",
297
+ "compartmentalising": "compartmentalizing",
298
+ "computerise": "computerize",
299
+ "computerised": "computerized",
300
+ "computerises": "computerizes",
301
+ "computerising": "computerizing",
302
+ "conceptualise": "conceptualize",
303
+ "conceptualised": "conceptualized",
304
+ "conceptualises": "conceptualizes",
305
+ "conceptualising": "conceptualizing",
306
+ "connexion": "connection",
307
+ "connexions": "connections",
308
+ "contextualise": "contextualize",
309
+ "contextualised": "contextualized",
310
+ "contextualises": "contextualizes",
311
+ "contextualising": "contextualizing",
312
+ "cosier": "cozier",
313
+ "cosies": "cozies",
314
+ "cosiest": "coziest",
315
+ "cosily": "cozily",
316
+ "cosiness": "coziness",
317
+ "cosy": "cozy",
318
+ "councillor": "councilor",
319
+ "councillors": "councilors",
320
+ "counselled": "counseled",
321
+ "counselling": "counseling",
322
+ "counsellor": "counselor",
323
+ "counsellors": "counselors",
324
+ "crenelated": "crenellated",
325
+ "criminalise": "criminalize",
326
+ "criminalised": "criminalized",
327
+ "criminalises": "criminalizes",
328
+ "criminalising": "criminalizing",
329
+ "criticise": "criticize",
330
+ "criticised": "criticized",
331
+ "criticises": "criticizes",
332
+ "criticising": "criticizing",
333
+ "crueller": "crueler",
334
+ "cruellest": "cruelest",
335
+ "crystallisation": "crystallization",
336
+ "crystallise": "crystallize",
337
+ "crystallised": "crystallized",
338
+ "crystallises": "crystallizes",
339
+ "crystallising": "crystallizing",
340
+ "cudgelled": "cudgeled",
341
+ "cudgelling": "cudgeling",
342
+ "customise": "customize",
343
+ "customised": "customized",
344
+ "customises": "customizes",
345
+ "customising": "customizing",
346
+ "cypher": "cipher",
347
+ "cyphers": "ciphers",
348
+ "decentralisation": "decentralization",
349
+ "decentralise": "decentralize",
350
+ "decentralised": "decentralized",
351
+ "decentralises": "decentralizes",
352
+ "decentralising": "decentralizing",
353
+ "decriminalisation": "decriminalization",
354
+ "decriminalise": "decriminalize",
355
+ "decriminalised": "decriminalized",
356
+ "decriminalises": "decriminalizes",
357
+ "decriminalising": "decriminalizing",
358
+ "defence": "defense",
359
+ "defenceless": "defenseless",
360
+ "defences": "defenses",
361
+ "dehumanisation": "dehumanization",
362
+ "dehumanise": "dehumanize",
363
+ "dehumanised": "dehumanized",
364
+ "dehumanises": "dehumanizes",
365
+ "dehumanising": "dehumanizing",
366
+ "demeanour": "demeanor",
367
+ "demilitarisation": "demilitarization",
368
+ "demilitarise": "demilitarize",
369
+ "demilitarised": "demilitarized",
370
+ "demilitarises": "demilitarizes",
371
+ "demilitarising": "demilitarizing",
372
+ "demobilisation": "demobilization",
373
+ "demobilise": "demobilize",
374
+ "demobilised": "demobilized",
375
+ "demobilises": "demobilizes",
376
+ "demobilising": "demobilizing",
377
+ "democratisation": "democratization",
378
+ "democratise": "democratize",
379
+ "democratised": "democratized",
380
+ "democratises": "democratizes",
381
+ "democratising": "democratizing",
382
+ "demonise": "demonize",
383
+ "demonised": "demonized",
384
+ "demonises": "demonizes",
385
+ "demonising": "demonizing",
386
+ "demoralisation": "demoralization",
387
+ "demoralise": "demoralize",
388
+ "demoralised": "demoralized",
389
+ "demoralises": "demoralizes",
390
+ "demoralising": "demoralizing",
391
+ "denationalisation": "denationalization",
392
+ "denationalise": "denationalize",
393
+ "denationalised": "denationalized",
394
+ "denationalises": "denationalizes",
395
+ "denationalising": "denationalizing",
396
+ "deodorise": "deodorize",
397
+ "deodorised": "deodorized",
398
+ "deodorises": "deodorizes",
399
+ "deodorising": "deodorizing",
400
+ "depersonalise": "depersonalize",
401
+ "depersonalised": "depersonalized",
402
+ "depersonalises": "depersonalizes",
403
+ "depersonalising": "depersonalizing",
404
+ "deputise": "deputize",
405
+ "deputised": "deputized",
406
+ "deputises": "deputizes",
407
+ "deputising": "deputizing",
408
+ "desensitisation": "desensitization",
409
+ "desensitise": "desensitize",
410
+ "desensitised": "desensitized",
411
+ "desensitises": "desensitizes",
412
+ "desensitising": "desensitizing",
413
+ "destabilisation": "destabilization",
414
+ "destabilise": "destabilize",
415
+ "destabilised": "destabilized",
416
+ "destabilises": "destabilizes",
417
+ "destabilising": "destabilizing",
418
+ "dialled": "dialed",
419
+ "dialling": "dialing",
420
+ "dialogue": "dialog",
421
+ "dialogues": "dialogs",
422
+ "diarrhoea": "diarrhea",
423
+ "digitise": "digitize",
424
+ "digitised": "digitized",
425
+ "digitises": "digitizes",
426
+ "digitising": "digitizing",
427
+ "disc": "disk",
428
+ "discolour": "discolor",
429
+ "discoloured": "discolored",
430
+ "discolouring": "discoloring",
431
+ "discolours": "discolors",
432
+ "discs": "disks",
433
+ "disembowelled": "disemboweled",
434
+ "disembowelling": "disemboweling",
435
+ "disfavour": "disfavor",
436
+ "dishevelled": "disheveled",
437
+ "dishonour": "dishonor",
438
+ "dishonourable": "dishonorable",
439
+ "dishonourably": "dishonorably",
440
+ "dishonoured": "dishonored",
441
+ "dishonouring": "dishonoring",
442
+ "dishonours": "dishonors",
443
+ "disorganisation": "disorganization",
444
+ "disorganised": "disorganized",
445
+ "distil": "distill",
446
+ "distils": "distills",
447
+ "dramatisation": "dramatization",
448
+ "dramatisations": "dramatizations",
449
+ "dramatise": "dramatize",
450
+ "dramatised": "dramatized",
451
+ "dramatises": "dramatizes",
452
+ "dramatising": "dramatizing",
453
+ "draught": "draft",
454
+ "draughtboard": "draftboard",
455
+ "draughtboards": "draftboards",
456
+ "draughtier": "draftier",
457
+ "draughtiest": "draftiest",
458
+ "draughts": "drafts",
459
+ "draughtsman": "draftsman",
460
+ "draughtsmanship": "draftsmanship",
461
+ "draughtsmen": "draftsmen",
462
+ "draughtswoman": "draftswoman",
463
+ "draughtswomen": "draftswomen",
464
+ "draughty": "drafty",
465
+ "drivelled": "driveled",
466
+ "drivelling": "driveling",
467
+ "duelled": "dueled",
468
+ "duelling": "dueling",
469
+ "economise": "economize",
470
+ "economised": "economized",
471
+ "economises": "economizes",
472
+ "economising": "economizing",
473
+ "editorialise": "editorialize",
474
+ "editorialised": "editorialized",
475
+ "editorialises": "editorializes",
476
+ "editorialising": "editorializing",
477
+ "edoema": "edema",
478
+ "empathise": "empathize",
479
+ "empathised": "empathized",
480
+ "empathises": "empathizes",
481
+ "empathising": "empathizing",
482
+ "emphasise": "emphasize",
483
+ "emphasised": "emphasized",
484
+ "emphasises": "emphasizes",
485
+ "emphasising": "emphasizing",
486
+ "enamelled": "enameled",
487
+ "enamelling": "enameling",
488
+ "enamoured": "enamored",
489
+ "encyclopaedia": "encyclopedia",
490
+ "encyclopaedias": "encyclopedias",
491
+ "encyclopaedic": "encyclopedic",
492
+ "endeavour": "endeavor",
493
+ "endeavoured": "endeavored",
494
+ "endeavouring": "endeavoring",
495
+ "endeavours": "endeavors",
496
+ "energise": "energize",
497
+ "energised": "energized",
498
+ "energises": "energizes",
499
+ "energising": "energizing",
500
+ "enrol": "enroll",
501
+ "enrols": "enrolls",
502
+ "enthral": "enthrall",
503
+ "enthrals": "enthralls",
504
+ "epaulette": "epaulet",
505
+ "epaulettes": "epaulets",
506
+ "epicentre": "epicenter",
507
+ "epicentres": "epicenters",
508
+ "epilogue": "epilog",
509
+ "epilogues": "epilogs",
510
+ "epitomise": "epitomize",
511
+ "epitomised": "epitomized",
512
+ "epitomises": "epitomizes",
513
+ "epitomising": "epitomizing",
514
+ "equalisation": "equalization",
515
+ "equalise": "equalize",
516
+ "equalised": "equalized",
517
+ "equaliser": "equalizer",
518
+ "equalisers": "equalizers",
519
+ "equalises": "equalizes",
520
+ "equalising": "equalizing",
521
+ "eulogise": "eulogize",
522
+ "eulogised": "eulogized",
523
+ "eulogises": "eulogizes",
524
+ "eulogising": "eulogizing",
525
+ "evangelise": "evangelize",
526
+ "evangelised": "evangelized",
527
+ "evangelises": "evangelizes",
528
+ "evangelising": "evangelizing",
529
+ "exorcise": "exorcize",
530
+ "exorcised": "exorcized",
531
+ "exorcises": "exorcizes",
532
+ "exorcising": "exorcizing",
533
+ "extemporisation": "extemporization",
534
+ "extemporise": "extemporize",
535
+ "extemporised": "extemporized",
536
+ "extemporises": "extemporizes",
537
+ "extemporising": "extemporizing",
538
+ "externalisation": "externalization",
539
+ "externalisations": "externalizations",
540
+ "externalise": "externalize",
541
+ "externalised": "externalized",
542
+ "externalises": "externalizes",
543
+ "externalising": "externalizing",
544
+ "factorise": "factorize",
545
+ "factorised": "factorized",
546
+ "factorises": "factorizes",
547
+ "factorising": "factorizing",
548
+ "faecal": "fecal",
549
+ "faeces": "feces",
550
+ "familiarisation": "familiarization",
551
+ "familiarise": "familiarize",
552
+ "familiarised": "familiarized",
553
+ "familiarises": "familiarizes",
554
+ "familiarising": "familiarizing",
555
+ "fantasise": "fantasize",
556
+ "fantasised": "fantasized",
557
+ "fantasises": "fantasizes",
558
+ "fantasising": "fantasizing",
559
+ "favour": "favor",
560
+ "favourable": "favorable",
561
+ "favourably": "favorably",
562
+ "favoured": "favored",
563
+ "favouring": "favoring",
564
+ "favourite": "favorite",
565
+ "favourites": "favorites",
566
+ "favouritism": "favoritism",
567
+ "favours": "favors",
568
+ "feminise": "feminize",
569
+ "feminised": "feminized",
570
+ "feminises": "feminizes",
571
+ "feminising": "feminizing",
572
+ "fertilisation": "fertilization",
573
+ "fertilise": "fertilize",
574
+ "fertilised": "fertilized",
575
+ "fertiliser": "fertilizer",
576
+ "fertilisers": "fertilizers",
577
+ "fertilises": "fertilizes",
578
+ "fertilising": "fertilizing",
579
+ "fervour": "fervor",
580
+ "fibre": "fiber",
581
+ "fibreglass": "fiberglass",
582
+ "fibres": "fibers",
583
+ "fictionalisation": "fictionalization",
584
+ "fictionalisations": "fictionalizations",
585
+ "fictionalise": "fictionalize",
586
+ "fictionalised": "fictionalized",
587
+ "fictionalises": "fictionalizes",
588
+ "fictionalising": "fictionalizing",
589
+ "fillet": "filet",
590
+ "filleted": "fileted",
591
+ "filleting": "fileting",
592
+ "fillets": "filets",
593
+ "finalisation": "finalization",
594
+ "finalise": "finalize",
595
+ "finalised": "finalized",
596
+ "finalises": "finalizes",
597
+ "finalising": "finalizing",
598
+ "flautist": "flutist",
599
+ "flautists": "flutists",
600
+ "flavour": "flavor",
601
+ "flavoured": "flavored",
602
+ "flavouring": "flavoring",
603
+ "flavourings": "flavorings",
604
+ "flavourless": "flavorless",
605
+ "flavours": "flavors",
606
+ "flavoursome": "flavorsome",
607
+ "flyer / flier": "flier / flyer",
608
+ "foetal": "fetal",
609
+ "foetid": "fetid",
610
+ "foetus": "fetus",
611
+ "foetuses": "fetuses",
612
+ "formalisation": "formalization",
613
+ "formalise": "formalize",
614
+ "formalised": "formalized",
615
+ "formalises": "formalizes",
616
+ "formalising": "formalizing",
617
+ "fossilisation": "fossilization",
618
+ "fossilise": "fossilize",
619
+ "fossilised": "fossilized",
620
+ "fossilises": "fossilizes",
621
+ "fossilising": "fossilizing",
622
+ "fraternisation": "fraternization",
623
+ "fraternise": "fraternize",
624
+ "fraternised": "fraternized",
625
+ "fraternises": "fraternizes",
626
+ "fraternising": "fraternizing",
627
+ "fulfil": "fulfill",
628
+ "fulfilment": "fulfillment",
629
+ "fulfils": "fulfills",
630
+ "funnelled": "funneled",
631
+ "funnelling": "funneling",
632
+ "gage": "gauge",
633
+ "gaged": "gauged",
634
+ "gages": "gauges",
635
+ "gaging": "gauging",
636
+ "galvanise": "galvanize",
637
+ "galvanised": "galvanized",
638
+ "galvanises": "galvanizes",
639
+ "galvanising": "galvanizing",
640
+ "gambolled": "gamboled",
641
+ "gambolling": "gamboling",
642
+ "gaol": "jail",
643
+ "gaolbird": "jailbird",
644
+ "gaolbirds": "jailbirds",
645
+ "gaolbreak": "jailbreak",
646
+ "gaolbreaks": "jailbreaks",
647
+ "gaoled": "jailed",
648
+ "gaoler": "jailer",
649
+ "gaolers": "jailers",
650
+ "gaoling": "jailing",
651
+ "gaols": "jails",
652
+ "gasses": "gases",
653
+ "generalisation": "generalization",
654
+ "generalisations": "generalizations",
655
+ "generalise": "generalize",
656
+ "generalised": "generalized",
657
+ "generalises": "generalizes",
658
+ "generalising": "generalizing",
659
+ "ghettoise": "ghettoize",
660
+ "ghettoised": "ghettoized",
661
+ "ghettoises": "ghettoizes",
662
+ "ghettoising": "ghettoizing",
663
+ "gipsies": "gypsies",
664
+ "glamor": "glamour",
665
+ "glamorise": "glamorize",
666
+ "glamorised": "glamorized",
667
+ "glamorises": "glamorizes",
668
+ "glamorising": "glamorizing",
669
+ "globalisation": "globalization",
670
+ "globalise": "globalize",
671
+ "globalised": "globalized",
672
+ "globalises": "globalizes",
673
+ "globalising": "globalizing",
674
+ "glueing": "gluing",
675
+ "goitre": "goiter",
676
+ "goitres": "goiters",
677
+ "gonorrhoea": "gonorrhea",
678
+ "gramme": "gram",
679
+ "grammes": "grams",
680
+ "gravelled": "graveled",
681
+ "grey": "gray",
682
+ "greyed": "grayed",
683
+ "greying": "graying",
684
+ "greyish": "grayish",
685
+ "greyness": "grayness",
686
+ "greys": "grays",
687
+ "grovelled": "groveled",
688
+ "grovelling": "groveling",
689
+ "groyne": "groin",
690
+ "groynes": "groins",
691
+ "gruelling": "grueling",
692
+ "gruellingly": "gruelingly",
693
+ "gryphon": "griffin",
694
+ "gryphons": "griffins",
695
+ "gynaecological": "gynecological",
696
+ "gynaecologist": "gynecologist",
697
+ "gynaecologists": "gynecologists",
698
+ "gynaecology": "gynecology",
699
+ "haematological": "hematological",
700
+ "haematologist": "hematologist",
701
+ "haematologists": "hematologists",
702
+ "haematology": "hematology",
703
+ "haemoglobin": "hemoglobin",
704
+ "haemophilia": "hemophilia",
705
+ "haemophiliac": "hemophiliac",
706
+ "haemophiliacs": "hemophiliacs",
707
+ "haemorrhage": "hemorrhage",
708
+ "haemorrhaged": "hemorrhaged",
709
+ "haemorrhages": "hemorrhages",
710
+ "haemorrhaging": "hemorrhaging",
711
+ "haemorrhoids": "hemorrhoids",
712
+ "harbour": "harbor",
713
+ "harboured": "harbored",
714
+ "harbouring": "harboring",
715
+ "harbours": "harbors",
716
+ "harmonisation": "harmonization",
717
+ "harmonise": "harmonize",
718
+ "harmonised": "harmonized",
719
+ "harmonises": "harmonizes",
720
+ "harmonising": "harmonizing",
721
+ "homoeopath": "homeopath",
722
+ "homoeopathic": "homeopathic",
723
+ "homoeopaths": "homeopaths",
724
+ "homoeopathy": "homeopathy",
725
+ "homogenise": "homogenize",
726
+ "homogenised": "homogenized",
727
+ "homogenises": "homogenizes",
728
+ "homogenising": "homogenizing",
729
+ "honour": "honor",
730
+ "honourable": "honorable",
731
+ "honourably": "honorably",
732
+ "honoured": "honored",
733
+ "honouring": "honoring",
734
+ "honours": "honors",
735
+ "hospitalisation": "hospitalization",
736
+ "hospitalise": "hospitalize",
737
+ "hospitalised": "hospitalized",
738
+ "hospitalises": "hospitalizes",
739
+ "hospitalising": "hospitalizing",
740
+ "humanise": "humanize",
741
+ "humanised": "humanized",
742
+ "humanises": "humanizes",
743
+ "humanising": "humanizing",
744
+ "humour": "humor",
745
+ "humoured": "humored",
746
+ "humouring": "humoring",
747
+ "humourless": "humorless",
748
+ "humours": "humors",
749
+ "hybridise": "hybridize",
750
+ "hybridised": "hybridized",
751
+ "hybridises": "hybridizes",
752
+ "hybridising": "hybridizing",
753
+ "hypnotise": "hypnotize",
754
+ "hypnotised": "hypnotized",
755
+ "hypnotises": "hypnotizes",
756
+ "hypnotising": "hypnotizing",
757
+ "hypothesise": "hypothesize",
758
+ "hypothesised": "hypothesized",
759
+ "hypothesises": "hypothesizes",
760
+ "hypothesising": "hypothesizing",
761
+ "idealisation": "idealization",
762
+ "idealise": "idealize",
763
+ "idealised": "idealized",
764
+ "idealises": "idealizes",
765
+ "idealising": "idealizing",
766
+ "idolise": "idolize",
767
+ "idolised": "idolized",
768
+ "idolises": "idolizes",
769
+ "idolising": "idolizing",
770
+ "immobilisation": "immobilization",
771
+ "immobilise": "immobilize",
772
+ "immobilised": "immobilized",
773
+ "immobiliser": "immobilizer",
774
+ "immobilisers": "immobilizers",
775
+ "immobilises": "immobilizes",
776
+ "immobilising": "immobilizing",
777
+ "immortalise": "immortalize",
778
+ "immortalised": "immortalized",
779
+ "immortalises": "immortalizes",
780
+ "immortalising": "immortalizing",
781
+ "immunisation": "immunization",
782
+ "immunise": "immunize",
783
+ "immunised": "immunized",
784
+ "immunises": "immunizes",
785
+ "immunising": "immunizing",
786
+ "impanelled": "impaneled",
787
+ "impanelling": "impaneling",
788
+ "imperilled": "imperiled",
789
+ "imperilling": "imperiling",
790
+ "individualise": "individualize",
791
+ "individualised": "individualized",
792
+ "individualises": "individualizes",
793
+ "individualising": "individualizing",
794
+ "industrialise": "industrialize",
795
+ "industrialised": "industrialized",
796
+ "industrialises": "industrializes",
797
+ "industrialising": "industrializing",
798
+ "inflexion": "inflection",
799
+ "inflexions": "inflections",
800
+ "initialise": "initialize",
801
+ "initialised": "initialized",
802
+ "initialises": "initializes",
803
+ "initialising": "initializing",
804
+ "initialled": "initialed",
805
+ "initialling": "initialing",
806
+ "instal": "install",
807
+ "instalment": "installment",
808
+ "instalments": "installments",
809
+ "instals": "installs",
810
+ "instil": "instill",
811
+ "instils": "instills",
812
+ "institutionalisation": "institutionalization",
813
+ "institutionalise": "institutionalize",
814
+ "institutionalised": "institutionalized",
815
+ "institutionalises": "institutionalizes",
816
+ "institutionalising": "institutionalizing",
817
+ "intellectualise": "intellectualize",
818
+ "intellectualised": "intellectualized",
819
+ "intellectualises": "intellectualizes",
820
+ "intellectualising": "intellectualizing",
821
+ "internalisation": "internalization",
822
+ "internalise": "internalize",
823
+ "internalised": "internalized",
824
+ "internalises": "internalizes",
825
+ "internalising": "internalizing",
826
+ "internationalisation": "internationalization",
827
+ "internationalise": "internationalize",
828
+ "internationalised": "internationalized",
829
+ "internationalises": "internationalizes",
830
+ "internationalising": "internationalizing",
831
+ "ionisation": "ionization",
832
+ "ionise": "ionize",
833
+ "ionised": "ionized",
834
+ "ioniser": "ionizer",
835
+ "ionisers": "ionizers",
836
+ "ionises": "ionizes",
837
+ "ionising": "ionizing",
838
+ "italicise": "italicize",
839
+ "italicised": "italicized",
840
+ "italicises": "italicizes",
841
+ "italicising": "italicizing",
842
+ "itemise": "itemize",
843
+ "itemised": "itemized",
844
+ "itemises": "itemizes",
845
+ "itemising": "itemizing",
846
+ "jeopardise": "jeopardize",
847
+ "jeopardised": "jeopardized",
848
+ "jeopardises": "jeopardizes",
849
+ "jeopardising": "jeopardizing",
850
+ "jewelled": "jeweled",
851
+ "jeweller": "jeweler",
852
+ "jewellers": "jewelers",
853
+ "jewellery": "jewelry",
854
+ "judgement": "judgment",
855
+ "kilogramme": "kilogram",
856
+ "kilogrammes": "kilograms",
857
+ "kilometre": "kilometer",
858
+ "kilometres": "kilometers",
859
+ "labelled": "labeled",
860
+ "labelling": "labeling",
861
+ "labour": "labor",
862
+ "laboured": "labored",
863
+ "labourer": "laborer",
864
+ "labourers": "laborers",
865
+ "labouring": "laboring",
866
+ "labours": "labors",
867
+ "lacklustre": "lackluster",
868
+ "legalisation": "legalization",
869
+ "legalise": "legalize",
870
+ "legalised": "legalized",
871
+ "legalises": "legalizes",
872
+ "legalising": "legalizing",
873
+ "legitimise": "legitimize",
874
+ "legitimised": "legitimized",
875
+ "legitimises": "legitimizes",
876
+ "legitimising": "legitimizing",
877
+ "leukaemia": "leukemia",
878
+ "levelled": "leveled",
879
+ "leveller": "leveler",
880
+ "levellers": "levelers",
881
+ "levelling": "leveling",
882
+ "libelled": "libeled",
883
+ "libelling": "libeling",
884
+ "libellous": "libelous",
885
+ "liberalisation": "liberalization",
886
+ "liberalise": "liberalize",
887
+ "liberalised": "liberalized",
888
+ "liberalises": "liberalizes",
889
+ "liberalising": "liberalizing",
890
+ "licence": "license",
891
+ "licenced": "licensed",
892
+ "licences": "licenses",
893
+ "licencing": "licensing",
894
+ "likeable": "likable",
895
+ "lionisation": "lionization",
896
+ "lionise": "lionize",
897
+ "lionised": "lionized",
898
+ "lionises": "lionizes",
899
+ "lionising": "lionizing",
900
+ "liquidise": "liquidize",
901
+ "liquidised": "liquidized",
902
+ "liquidiser": "liquidizer",
903
+ "liquidisers": "liquidizers",
904
+ "liquidises": "liquidizes",
905
+ "liquidising": "liquidizing",
906
+ "litre": "liter",
907
+ "litres": "liters",
908
+ "localise": "localize",
909
+ "localised": "localized",
910
+ "localises": "localizes",
911
+ "localising": "localizing",
912
+ "louvre": "louver",
913
+ "louvred": "louvered",
914
+ "louvres": "louvers",
915
+ "lustre": "luster",
916
+ "magnetise": "magnetize",
917
+ "magnetised": "magnetized",
918
+ "magnetises": "magnetizes",
919
+ "magnetising": "magnetizing",
920
+ "manoeuvrability": "maneuverability",
921
+ "manoeuvrable": "maneuverable",
922
+ "manoeuvre": "maneuver",
923
+ "manoeuvred": "maneuvered",
924
+ "manoeuvres": "maneuvers",
925
+ "manoeuvring": "maneuvering",
926
+ "manoeuvrings": "maneuverings",
927
+ "marginalisation": "marginalization",
928
+ "marginalise": "marginalize",
929
+ "marginalised": "marginalized",
930
+ "marginalises": "marginalizes",
931
+ "marginalising": "marginalizing",
932
+ "marshalled": "marshaled",
933
+ "marshalling": "marshaling",
934
+ "marvelled": "marveled",
935
+ "marvelling": "marveling",
936
+ "marvellous": "marvelous",
937
+ "marvellously": "marvelously",
938
+ "materialisation": "materialization",
939
+ "materialise": "materialize",
940
+ "materialised": "materialized",
941
+ "materialises": "materializes",
942
+ "materialising": "materializing",
943
+ "maximisation": "maximization",
944
+ "maximise": "maximize",
945
+ "maximised": "maximized",
946
+ "maximises": "maximizes",
947
+ "maximising": "maximizing",
948
+ "meagre": "meager",
949
+ "mechanisation": "mechanization",
950
+ "mechanise": "mechanize",
951
+ "mechanised": "mechanized",
952
+ "mechanises": "mechanizes",
953
+ "mechanising": "mechanizing",
954
+ "mediaeval": "medieval",
955
+ "memorialise": "memorialize",
956
+ "memorialised": "memorialized",
957
+ "memorialises": "memorializes",
958
+ "memorialising": "memorializing",
959
+ "memorise": "memorize",
960
+ "memorised": "memorized",
961
+ "memorises": "memorizes",
962
+ "memorising": "memorizing",
963
+ "mesmerise": "mesmerize",
964
+ "mesmerised": "mesmerized",
965
+ "mesmerises": "mesmerizes",
966
+ "mesmerising": "mesmerizing",
967
+ "metabolise": "metabolize",
968
+ "metabolised": "metabolized",
969
+ "metabolises": "metabolizes",
970
+ "metabolising": "metabolizing",
971
+ "metre": "meter",
972
+ "metres": "meters",
973
+ "mhm": "hmm",
974
+ "micrometre": "micrometer",
975
+ "micrometres": "micrometers",
976
+ "militarise": "militarize",
977
+ "militarised": "militarized",
978
+ "militarises": "militarizes",
979
+ "militarising": "militarizing",
980
+ "milligramme": "milligram",
981
+ "milligrammes": "milligrams",
982
+ "millilitre": "milliliter",
983
+ "millilitres": "milliliters",
984
+ "millimetre": "millimeter",
985
+ "millimetres": "millimeters",
986
+ "miniaturisation": "miniaturization",
987
+ "miniaturise": "miniaturize",
988
+ "miniaturised": "miniaturized",
989
+ "miniaturises": "miniaturizes",
990
+ "miniaturising": "miniaturizing",
991
+ "minibusses": "minibuses",
992
+ "minimise": "minimize",
993
+ "minimised": "minimized",
994
+ "minimises": "minimizes",
995
+ "minimising": "minimizing",
996
+ "misbehaviour": "misbehavior",
997
+ "misdemeanour": "misdemeanor",
998
+ "misdemeanours": "misdemeanors",
999
+ "misspelt": "misspelled",
1000
+ "mitre": "miter",
1001
+ "mitres": "miters",
1002
+ "mm": "hmm",
1003
+ "mmm": "hmm",
1004
+ "mobilisation": "mobilization",
1005
+ "mobilise": "mobilize",
1006
+ "mobilised": "mobilized",
1007
+ "mobilises": "mobilizes",
1008
+ "mobilising": "mobilizing",
1009
+ "modelled": "modeled",
1010
+ "modeller": "modeler",
1011
+ "modellers": "modelers",
1012
+ "modelling": "modeling",
1013
+ "modernise": "modernize",
1014
+ "modernised": "modernized",
1015
+ "modernises": "modernizes",
1016
+ "modernising": "modernizing",
1017
+ "moisturise": "moisturize",
1018
+ "moisturised": "moisturized",
1019
+ "moisturiser": "moisturizer",
1020
+ "moisturisers": "moisturizers",
1021
+ "moisturises": "moisturizes",
1022
+ "moisturising": "moisturizing",
1023
+ "monologue": "monolog",
1024
+ "monologues": "monologs",
1025
+ "monopolisation": "monopolization",
1026
+ "monopolise": "monopolize",
1027
+ "monopolised": "monopolized",
1028
+ "monopolises": "monopolizes",
1029
+ "monopolising": "monopolizing",
1030
+ "moralise": "moralize",
1031
+ "moralised": "moralized",
1032
+ "moralises": "moralizes",
1033
+ "moralising": "moralizing",
1034
+ "motorised": "motorized",
1035
+ "mould": "mold",
1036
+ "moulded": "molded",
1037
+ "moulder": "molder",
1038
+ "mouldered": "moldered",
1039
+ "mouldering": "moldering",
1040
+ "moulders": "molders",
1041
+ "mouldier": "moldier",
1042
+ "mouldiest": "moldiest",
1043
+ "moulding": "molding",
1044
+ "mouldings": "moldings",
1045
+ "moulds": "molds",
1046
+ "mouldy": "moldy",
1047
+ "moult": "molt",
1048
+ "moulted": "molted",
1049
+ "moulting": "molting",
1050
+ "moults": "molts",
1051
+ "moustache": "mustache",
1052
+ "moustached": "mustached",
1053
+ "moustaches": "mustaches",
1054
+ "moustachioed": "mustachioed",
1055
+ "multicoloured": "multicolored",
1056
+ "nationalisation": "nationalization",
1057
+ "nationalisations": "nationalizations",
1058
+ "nationalise": "nationalize",
1059
+ "nationalised": "nationalized",
1060
+ "nationalises": "nationalizes",
1061
+ "nationalising": "nationalizing",
1062
+ "naturalisation": "naturalization",
1063
+ "naturalise": "naturalize",
1064
+ "naturalised": "naturalized",
1065
+ "naturalises": "naturalizes",
1066
+ "naturalising": "naturalizing",
1067
+ "neighbour": "neighbor",
1068
+ "neighbourhood": "neighborhood",
1069
+ "neighbourhoods": "neighborhoods",
1070
+ "neighbouring": "neighboring",
1071
+ "neighbourliness": "neighborliness",
1072
+ "neighbourly": "neighborly",
1073
+ "neighbours": "neighbors",
1074
+ "neutralisation": "neutralization",
1075
+ "neutralise": "neutralize",
1076
+ "neutralised": "neutralized",
1077
+ "neutralises": "neutralizes",
1078
+ "neutralising": "neutralizing",
1079
+ "normalisation": "normalization",
1080
+ "normalise": "normalize",
1081
+ "normalised": "normalized",
1082
+ "normalises": "normalizes",
1083
+ "normalising": "normalizing",
1084
+ "odour": "odor",
1085
+ "odourless": "odorless",
1086
+ "odours": "odors",
1087
+ "oesophagus": "esophagus",
1088
+ "oesophaguses": "esophaguses",
1089
+ "oestrogen": "estrogen",
1090
+ "offence": "offense",
1091
+ "offences": "offenses",
1092
+ "omelette": "omelet",
1093
+ "omelettes": "omelets",
1094
+ "optimise": "optimize",
1095
+ "optimised": "optimized",
1096
+ "optimises": "optimizes",
1097
+ "optimising": "optimizing",
1098
+ "organisation": "organization",
1099
+ "organisational": "organizational",
1100
+ "organisations": "organizations",
1101
+ "organise": "organize",
1102
+ "organised": "organized",
1103
+ "organiser": "organizer",
1104
+ "organisers": "organizers",
1105
+ "organises": "organizes",
1106
+ "organising": "organizing",
1107
+ "orthopaedic": "orthopedic",
1108
+ "orthopaedics": "orthopedics",
1109
+ "ostracise": "ostracize",
1110
+ "ostracised": "ostracized",
1111
+ "ostracises": "ostracizes",
1112
+ "ostracising": "ostracizing",
1113
+ "outmanoeuvre": "outmaneuver",
1114
+ "outmanoeuvred": "outmaneuvered",
1115
+ "outmanoeuvres": "outmaneuvers",
1116
+ "outmanoeuvring": "outmaneuvering",
1117
+ "overemphasise": "overemphasize",
1118
+ "overemphasised": "overemphasized",
1119
+ "overemphasises": "overemphasizes",
1120
+ "overemphasising": "overemphasizing",
1121
+ "oxidisation": "oxidization",
1122
+ "oxidise": "oxidize",
1123
+ "oxidised": "oxidized",
1124
+ "oxidises": "oxidizes",
1125
+ "oxidising": "oxidizing",
1126
+ "paederast": "pederast",
1127
+ "paederasts": "pederasts",
1128
+ "paediatric": "pediatric",
1129
+ "paediatrician": "pediatrician",
1130
+ "paediatricians": "pediatricians",
1131
+ "paediatrics": "pediatrics",
1132
+ "paedophile": "pedophile",
1133
+ "paedophiles": "pedophiles",
1134
+ "paedophilia": "pedophilia",
1135
+ "palaeolithic": "paleolithic",
1136
+ "palaeontologist": "paleontologist",
1137
+ "palaeontologists": "paleontologists",
1138
+ "palaeontology": "paleontology",
1139
+ "panelled": "paneled",
1140
+ "panelling": "paneling",
1141
+ "panellist": "panelist",
1142
+ "panellists": "panelists",
1143
+ "paralyse": "paralyze",
1144
+ "paralysed": "paralyzed",
1145
+ "paralyses": "paralyzes",
1146
+ "paralysing": "paralyzing",
1147
+ "parcelled": "parceled",
1148
+ "parcelling": "parceling",
1149
+ "parlour": "parlor",
1150
+ "parlours": "parlors",
1151
+ "particularise": "particularize",
1152
+ "particularised": "particularized",
1153
+ "particularises": "particularizes",
1154
+ "particularising": "particularizing",
1155
+ "passivisation": "passivization",
1156
+ "passivise": "passivize",
1157
+ "passivised": "passivized",
1158
+ "passivises": "passivizes",
1159
+ "passivising": "passivizing",
1160
+ "pasteurisation": "pasteurization",
1161
+ "pasteurise": "pasteurize",
1162
+ "pasteurised": "pasteurized",
1163
+ "pasteurises": "pasteurizes",
1164
+ "pasteurising": "pasteurizing",
1165
+ "patronise": "patronize",
1166
+ "patronised": "patronized",
1167
+ "patronises": "patronizes",
1168
+ "patronising": "patronizing",
1169
+ "patronisingly": "patronizingly",
1170
+ "pedalled": "pedaled",
1171
+ "pedalling": "pedaling",
1172
+ "pedestrianisation": "pedestrianization",
1173
+ "pedestrianise": "pedestrianize",
1174
+ "pedestrianised": "pedestrianized",
1175
+ "pedestrianises": "pedestrianizes",
1176
+ "pedestrianising": "pedestrianizing",
1177
+ "penalise": "penalize",
1178
+ "penalised": "penalized",
1179
+ "penalises": "penalizes",
1180
+ "penalising": "penalizing",
1181
+ "pencilled": "penciled",
1182
+ "pencilling": "penciling",
1183
+ "personalise": "personalize",
1184
+ "personalised": "personalized",
1185
+ "personalises": "personalizes",
1186
+ "personalising": "personalizing",
1187
+ "pharmacopoeia": "pharmacopeia",
1188
+ "pharmacopoeias": "pharmacopeias",
1189
+ "philosophise": "philosophize",
1190
+ "philosophised": "philosophized",
1191
+ "philosophises": "philosophizes",
1192
+ "philosophising": "philosophizing",
1193
+ "philtre": "filter",
1194
+ "philtres": "filters",
1195
+ "phoney": "phony",
1196
+ "plagiarise": "plagiarize",
1197
+ "plagiarised": "plagiarized",
1198
+ "plagiarises": "plagiarizes",
1199
+ "plagiarising": "plagiarizing",
1200
+ "plough": "plow",
1201
+ "ploughed": "plowed",
1202
+ "ploughing": "plowing",
1203
+ "ploughman": "plowman",
1204
+ "ploughmen": "plowmen",
1205
+ "ploughs": "plows",
1206
+ "ploughshare": "plowshare",
1207
+ "ploughshares": "plowshares",
1208
+ "polarisation": "polarization",
1209
+ "polarise": "polarize",
1210
+ "polarised": "polarized",
1211
+ "polarises": "polarizes",
1212
+ "polarising": "polarizing",
1213
+ "politicisation": "politicization",
1214
+ "politicise": "politicize",
1215
+ "politicised": "politicized",
1216
+ "politicises": "politicizes",
1217
+ "politicising": "politicizing",
1218
+ "popularisation": "popularization",
1219
+ "popularise": "popularize",
1220
+ "popularised": "popularized",
1221
+ "popularises": "popularizes",
1222
+ "popularising": "popularizing",
1223
+ "pouffe": "pouf",
1224
+ "pouffes": "poufs",
1225
+ "practise": "practice",
1226
+ "practised": "practiced",
1227
+ "practises": "practices",
1228
+ "practising": "practicing",
1229
+ "praesidium": "presidium",
1230
+ "praesidiums": "presidiums",
1231
+ "pressurisation": "pressurization",
1232
+ "pressurise": "pressurize",
1233
+ "pressurised": "pressurized",
1234
+ "pressurises": "pressurizes",
1235
+ "pressurising": "pressurizing",
1236
+ "pretence": "pretense",
1237
+ "pretences": "pretenses",
1238
+ "primaeval": "primeval",
1239
+ "prioritisation": "prioritization",
1240
+ "prioritise": "prioritize",
1241
+ "prioritised": "prioritized",
1242
+ "prioritises": "prioritizes",
1243
+ "prioritising": "prioritizing",
1244
+ "privatisation": "privatization",
1245
+ "privatisations": "privatizations",
1246
+ "privatise": "privatize",
1247
+ "privatised": "privatized",
1248
+ "privatises": "privatizes",
1249
+ "privatising": "privatizing",
1250
+ "professionalisation": "professionalization",
1251
+ "professionalise": "professionalize",
1252
+ "professionalised": "professionalized",
1253
+ "professionalises": "professionalizes",
1254
+ "professionalising": "professionalizing",
1255
+ "programme": "program",
1256
+ "programmes": "programs",
1257
+ "prologue": "prolog",
1258
+ "prologues": "prologs",
1259
+ "propagandise": "propagandize",
1260
+ "propagandised": "propagandized",
1261
+ "propagandises": "propagandizes",
1262
+ "propagandising": "propagandizing",
1263
+ "proselytise": "proselytize",
1264
+ "proselytised": "proselytized",
1265
+ "proselytiser": "proselytizer",
1266
+ "proselytisers": "proselytizers",
1267
+ "proselytises": "proselytizes",
1268
+ "proselytising": "proselytizing",
1269
+ "psychoanalyse": "psychoanalyze",
1270
+ "psychoanalysed": "psychoanalyzed",
1271
+ "psychoanalyses": "psychoanalyzes",
1272
+ "psychoanalysing": "psychoanalyzing",
1273
+ "publicise": "publicize",
1274
+ "publicised": "publicized",
1275
+ "publicises": "publicizes",
1276
+ "publicising": "publicizing",
1277
+ "pulverisation": "pulverization",
1278
+ "pulverise": "pulverize",
1279
+ "pulverised": "pulverized",
1280
+ "pulverises": "pulverizes",
1281
+ "pulverising": "pulverizing",
1282
+ "pummelled": "pummel",
1283
+ "pummelling": "pummeled",
1284
+ "pyjama": "pajama",
1285
+ "pyjamas": "pajamas",
1286
+ "pzazz": "pizzazz",
1287
+ "quarrelled": "quarreled",
1288
+ "quarrelling": "quarreling",
1289
+ "radicalise": "radicalize",
1290
+ "radicalised": "radicalized",
1291
+ "radicalises": "radicalizes",
1292
+ "radicalising": "radicalizing",
1293
+ "rancour": "rancor",
1294
+ "randomise": "randomize",
1295
+ "randomised": "randomized",
1296
+ "randomises": "randomizes",
1297
+ "randomising": "randomizing",
1298
+ "rationalisation": "rationalization",
1299
+ "rationalisations": "rationalizations",
1300
+ "rationalise": "rationalize",
1301
+ "rationalised": "rationalized",
1302
+ "rationalises": "rationalizes",
1303
+ "rationalising": "rationalizing",
1304
+ "ravelled": "raveled",
1305
+ "ravelling": "raveling",
1306
+ "realisable": "realizable",
1307
+ "realisation": "realization",
1308
+ "realisations": "realizations",
1309
+ "realise": "realize",
1310
+ "realised": "realized",
1311
+ "realises": "realizes",
1312
+ "realising": "realizing",
1313
+ "recognisable": "recognizable",
1314
+ "recognisably": "recognizably",
1315
+ "recognisance": "recognizance",
1316
+ "recognise": "recognize",
1317
+ "recognised": "recognized",
1318
+ "recognises": "recognizes",
1319
+ "recognising": "recognizing",
1320
+ "reconnoitre": "reconnoiter",
1321
+ "reconnoitred": "reconnoitered",
1322
+ "reconnoitres": "reconnoiters",
1323
+ "reconnoitring": "reconnoitering",
1324
+ "refuelled": "refueled",
1325
+ "refuelling": "refueling",
1326
+ "regularisation": "regularization",
1327
+ "regularise": "regularize",
1328
+ "regularised": "regularized",
1329
+ "regularises": "regularizes",
1330
+ "regularising": "regularizing",
1331
+ "remodelled": "remodeled",
1332
+ "remodelling": "remodeling",
1333
+ "remould": "remold",
1334
+ "remoulded": "remolded",
1335
+ "remoulding": "remolding",
1336
+ "remoulds": "remolds",
1337
+ "reorganisation": "reorganization",
1338
+ "reorganisations": "reorganizations",
1339
+ "reorganise": "reorganize",
1340
+ "reorganised": "reorganized",
1341
+ "reorganises": "reorganizes",
1342
+ "reorganising": "reorganizing",
1343
+ "revelled": "reveled",
1344
+ "reveller": "reveler",
1345
+ "revellers": "revelers",
1346
+ "revelling": "reveling",
1347
+ "revitalise": "revitalize",
1348
+ "revitalised": "revitalized",
1349
+ "revitalises": "revitalizes",
1350
+ "revitalising": "revitalizing",
1351
+ "revolutionise": "revolutionize",
1352
+ "revolutionised": "revolutionized",
1353
+ "revolutionises": "revolutionizes",
1354
+ "revolutionising": "revolutionizing",
1355
+ "rhapsodise": "rhapsodize",
1356
+ "rhapsodised": "rhapsodized",
1357
+ "rhapsodises": "rhapsodizes",
1358
+ "rhapsodising": "rhapsodizing",
1359
+ "rigour": "rigor",
1360
+ "rigours": "rigors",
1361
+ "ritualised": "ritualized",
1362
+ "rivalled": "rivaled",
1363
+ "rivalling": "rivaling",
1364
+ "romanticise": "romanticize",
1365
+ "romanticised": "romanticized",
1366
+ "romanticises": "romanticizes",
1367
+ "romanticising": "romanticizing",
1368
+ "rumour": "rumor",
1369
+ "rumoured": "rumored",
1370
+ "rumours": "rumors",
1371
+ "sabre": "saber",
1372
+ "sabres": "sabers",
1373
+ "saltpetre": "saltpeter",
1374
+ "sanitise": "sanitize",
1375
+ "sanitised": "sanitized",
1376
+ "sanitises": "sanitizes",
1377
+ "sanitising": "sanitizing",
1378
+ "satirise": "satirize",
1379
+ "satirised": "satirized",
1380
+ "satirises": "satirizes",
1381
+ "satirising": "satirizing",
1382
+ "saviour": "savior",
1383
+ "saviours": "saviors",
1384
+ "savour": "savor",
1385
+ "savoured": "savored",
1386
+ "savouries": "savories",
1387
+ "savouring": "savoring",
1388
+ "savours": "savors",
1389
+ "savoury": "savory",
1390
+ "scandalise": "scandalize",
1391
+ "scandalised": "scandalized",
1392
+ "scandalises": "scandalizes",
1393
+ "scandalising": "scandalizing",
1394
+ "sceptic": "skeptic",
1395
+ "sceptical": "skeptical",
1396
+ "sceptically": "skeptically",
1397
+ "scepticism": "skepticism",
1398
+ "sceptics": "skeptics",
1399
+ "sceptre": "scepter",
1400
+ "sceptres": "scepters",
1401
+ "scrutinise": "scrutinize",
1402
+ "scrutinised": "scrutinized",
1403
+ "scrutinises": "scrutinizes",
1404
+ "scrutinising": "scrutinizing",
1405
+ "secularisation": "secularization",
1406
+ "secularise": "secularize",
1407
+ "secularised": "secularized",
1408
+ "secularises": "secularizes",
1409
+ "secularising": "secularizing",
1410
+ "sensationalise": "sensationalize",
1411
+ "sensationalised": "sensationalized",
1412
+ "sensationalises": "sensationalizes",
1413
+ "sensationalising": "sensationalizing",
1414
+ "sensitise": "sensitize",
1415
+ "sensitised": "sensitized",
1416
+ "sensitises": "sensitizes",
1417
+ "sensitising": "sensitizing",
1418
+ "sentimentalise": "sentimentalize",
1419
+ "sentimentalised": "sentimentalized",
1420
+ "sentimentalises": "sentimentalizes",
1421
+ "sentimentalising": "sentimentalizing",
1422
+ "sepulchre": "sepulcher",
1423
+ "sepulchres": "sepulchers",
1424
+ "serialisation": "serialization",
1425
+ "serialisations": "serializations",
1426
+ "serialise": "serialize",
1427
+ "serialised": "serialized",
1428
+ "serialises": "serializes",
1429
+ "serialising": "serializing",
1430
+ "sermonise": "sermonize",
1431
+ "sermonised": "sermonized",
1432
+ "sermonises": "sermonizes",
1433
+ "sermonising": "sermonizing",
1434
+ "sheikh": "sheik",
1435
+ "shovelled": "shoveled",
1436
+ "shovelling": "shoveling",
1437
+ "shrivelled": "shriveled",
1438
+ "shrivelling": "shriveling",
1439
+ "signalise": "signalize",
1440
+ "signalised": "signalized",
1441
+ "signalises": "signalizes",
1442
+ "signalising": "signalizing",
1443
+ "signalled": "signaled",
1444
+ "signalling": "signaling",
1445
+ "smoulder": "smolder",
1446
+ "smouldered": "smoldered",
1447
+ "smouldering": "smoldering",
1448
+ "smoulders": "smolders",
1449
+ "snivelled": "sniveled",
1450
+ "snivelling": "sniveling",
1451
+ "snorkelled": "snorkeled",
1452
+ "snorkelling": "snorkeling",
1453
+ "snowplough": "snowplow",
1454
+ "snowploughs": "snowplow",
1455
+ "socialisation": "socialization",
1456
+ "socialise": "socialize",
1457
+ "socialised": "socialized",
1458
+ "socialises": "socializes",
1459
+ "socialising": "socializing",
1460
+ "sodomise": "sodomize",
1461
+ "sodomised": "sodomized",
1462
+ "sodomises": "sodomizes",
1463
+ "sodomising": "sodomizing",
1464
+ "solemnise": "solemnize",
1465
+ "solemnised": "solemnized",
1466
+ "solemnises": "solemnizes",
1467
+ "solemnising": "solemnizing",
1468
+ "sombre": "somber",
1469
+ "specialisation": "specialization",
1470
+ "specialisations": "specializations",
1471
+ "specialise": "specialize",
1472
+ "specialised": "specialized",
1473
+ "specialises": "specializes",
1474
+ "specialising": "specializing",
1475
+ "spectre": "specter",
1476
+ "spectres": "specters",
1477
+ "spiralled": "spiraled",
1478
+ "spiralling": "spiraling",
1479
+ "splendour": "splendor",
1480
+ "splendours": "splendors",
1481
+ "squirrelled": "squirreled",
1482
+ "squirrelling": "squirreling",
1483
+ "stabilisation": "stabilization",
1484
+ "stabilise": "stabilize",
1485
+ "stabilised": "stabilized",
1486
+ "stabiliser": "stabilizer",
1487
+ "stabilisers": "stabilizers",
1488
+ "stabilises": "stabilizes",
1489
+ "stabilising": "stabilizing",
1490
+ "standardisation": "standardization",
1491
+ "standardise": "standardize",
1492
+ "standardised": "standardized",
1493
+ "standardises": "standardizes",
1494
+ "standardising": "standardizing",
1495
+ "stencilled": "stenciled",
1496
+ "stencilling": "stenciling",
1497
+ "sterilisation": "sterilization",
1498
+ "sterilisations": "sterilizations",
1499
+ "sterilise": "sterilize",
1500
+ "sterilised": "sterilized",
1501
+ "steriliser": "sterilizer",
1502
+ "sterilisers": "sterilizers",
1503
+ "sterilises": "sterilizes",
1504
+ "sterilising": "sterilizing",
1505
+ "stigmatisation": "stigmatization",
1506
+ "stigmatise": "stigmatize",
1507
+ "stigmatised": "stigmatized",
1508
+ "stigmatises": "stigmatizes",
1509
+ "stigmatising": "stigmatizing",
1510
+ "storey": "story",
1511
+ "storeys": "stories",
1512
+ "subsidisation": "subsidization",
1513
+ "subsidise": "subsidize",
1514
+ "subsidised": "subsidized",
1515
+ "subsidiser": "subsidizer",
1516
+ "subsidisers": "subsidizers",
1517
+ "subsidises": "subsidizes",
1518
+ "subsidising": "subsidizing",
1519
+ "succour": "succor",
1520
+ "succoured": "succored",
1521
+ "succouring": "succoring",
1522
+ "succours": "succors",
1523
+ "sulphate": "sulfate",
1524
+ "sulphates": "sulfates",
1525
+ "sulphide": "sulfide",
1526
+ "sulphides": "sulfides",
1527
+ "sulphur": "sulfur",
1528
+ "sulphurous": "sulfurous",
1529
+ "summarise": "summarize",
1530
+ "summarised": "summarized",
1531
+ "summarises": "summarizes",
1532
+ "summarising": "summarizing",
1533
+ "swivelled": "swiveled",
1534
+ "swivelling": "swiveling",
1535
+ "symbolise": "symbolize",
1536
+ "symbolised": "symbolized",
1537
+ "symbolises": "symbolizes",
1538
+ "symbolising": "symbolizing",
1539
+ "sympathise": "sympathize",
1540
+ "sympathised": "sympathized",
1541
+ "sympathiser": "sympathizer",
1542
+ "sympathisers": "sympathizers",
1543
+ "sympathises": "sympathizes",
1544
+ "sympathising": "sympathizing",
1545
+ "synchronisation": "synchronization",
1546
+ "synchronise": "synchronize",
1547
+ "synchronised": "synchronized",
1548
+ "synchronises": "synchronizes",
1549
+ "synchronising": "synchronizing",
1550
+ "synthesise": "synthesize",
1551
+ "synthesised": "synthesized",
1552
+ "synthesiser": "synthesizer",
1553
+ "synthesisers": "synthesizers",
1554
+ "synthesises": "synthesizes",
1555
+ "synthesising": "synthesizing",
1556
+ "syphon": "siphon",
1557
+ "syphoned": "siphoned",
1558
+ "syphoning": "siphoning",
1559
+ "syphons": "siphons",
1560
+ "systematisation": "systematization",
1561
+ "systematise": "systematize",
1562
+ "systematised": "systematized",
1563
+ "systematises": "systematizes",
1564
+ "systematising": "systematizing",
1565
+ "tantalise": "tantalize",
1566
+ "tantalised": "tantalized",
1567
+ "tantalises": "tantalizes",
1568
+ "tantalising": "tantalizing",
1569
+ "tantalisingly": "tantalizingly",
1570
+ "tasselled": "tasseled",
1571
+ "technicolour": "technicolor",
1572
+ "temporise": "temporize",
1573
+ "temporised": "temporized",
1574
+ "temporises": "temporizes",
1575
+ "temporising": "temporizing",
1576
+ "tenderise": "tenderize",
1577
+ "tenderised": "tenderized",
1578
+ "tenderises": "tenderizes",
1579
+ "tenderising": "tenderizing",
1580
+ "terrorise": "terrorize",
1581
+ "terrorised": "terrorized",
1582
+ "terrorises": "terrorizes",
1583
+ "terrorising": "terrorizing",
1584
+ "theatre": "theater",
1585
+ "theatregoer": "theatergoer",
1586
+ "theatregoers": "theatergoers",
1587
+ "theatres": "theaters",
1588
+ "theorise": "theorize",
1589
+ "theorised": "theorized",
1590
+ "theorises": "theorizes",
1591
+ "theorising": "theorizing",
1592
+ "tonne": "ton",
1593
+ "tonnes": "tons",
1594
+ "towelled": "toweled",
1595
+ "towelling": "toweling",
1596
+ "toxaemia": "toxemia",
1597
+ "tranquillise": "tranquilize",
1598
+ "tranquillised": "tranquilized",
1599
+ "tranquilliser": "tranquilizer",
1600
+ "tranquillisers": "tranquilizers",
1601
+ "tranquillises": "tranquilizes",
1602
+ "tranquillising": "tranquilizing",
1603
+ "tranquillity": "tranquility",
1604
+ "tranquillize": "tranquilize",
1605
+ "tranquillized": "tranquilized",
1606
+ "tranquillizer": "tranquilizer",
1607
+ "tranquillizers": "tranquilizers",
1608
+ "tranquillizes": "tranquilizes",
1609
+ "tranquillizing": "tranquilizing",
1610
+ "tranquilly": "tranquility",
1611
+ "transistorised": "transistorized",
1612
+ "traumatise": "traumatize",
1613
+ "traumatised": "traumatized",
1614
+ "traumatises": "traumatizes",
1615
+ "traumatising": "traumatizing",
1616
+ "travelled": "traveled",
1617
+ "traveller": "traveler",
1618
+ "travellers": "travelers",
1619
+ "travelling": "traveling",
1620
+ "travelog": "travelogue",
1621
+ "travelogs": "travelogues",
1622
+ "trialled": "trialed",
1623
+ "trialling": "trialing",
1624
+ "tricolour": "tricolor",
1625
+ "tricolours": "tricolors",
1626
+ "trivialise": "trivialize",
1627
+ "trivialised": "trivialized",
1628
+ "trivialises": "trivializes",
1629
+ "trivialising": "trivializing",
1630
+ "tumour": "tumor",
1631
+ "tumours": "tumors",
1632
+ "tunnelled": "tunneled",
1633
+ "tunnelling": "tunneling",
1634
+ "tyrannise": "tyrannize",
1635
+ "tyrannised": "tyrannized",
1636
+ "tyrannises": "tyrannizes",
1637
+ "tyrannising": "tyrannizing",
1638
+ "tyre": "tire",
1639
+ "tyres": "tires",
1640
+ "unauthorised": "unauthorized",
1641
+ "uncivilised": "uncivilized",
1642
+ "underutilised": "underutilized",
1643
+ "unequalled": "unequaled",
1644
+ "unfavourable": "unfavorable",
1645
+ "unfavourably": "unfavorably",
1646
+ "unionisation": "unionization",
1647
+ "unionise": "unionize",
1648
+ "unionised": "unionized",
1649
+ "unionises": "unionizes",
1650
+ "unionising": "unionizing",
1651
+ "unorganised": "unorganized",
1652
+ "unravelled": "unraveled",
1653
+ "unravelling": "unraveling",
1654
+ "unrecognisable": "unrecognizable",
1655
+ "unrecognised": "unrecognized",
1656
+ "unrivalled": "unrivaled",
1657
+ "unsavoury": "unsavory",
1658
+ "untrammelled": "untrammeled",
1659
+ "urbanisation": "urbanization",
1660
+ "urbanise": "urbanize",
1661
+ "urbanised": "urbanized",
1662
+ "urbanises": "urbanizes",
1663
+ "urbanising": "urbanizing",
1664
+ "utilisable": "utilizable",
1665
+ "utilisation": "utilization",
1666
+ "utilise": "utilize",
1667
+ "utilised": "utilized",
1668
+ "utilises": "utilizes",
1669
+ "utilising": "utilizing",
1670
+ "valour": "valor",
1671
+ "vandalise": "vandalize",
1672
+ "vandalised": "vandalized",
1673
+ "vandalises": "vandalizes",
1674
+ "vandalising": "vandalizing",
1675
+ "vaporisation": "vaporization",
1676
+ "vaporise": "vaporize",
1677
+ "vaporised": "vaporized",
1678
+ "vaporises": "vaporizes",
1679
+ "vaporising": "vaporizing",
1680
+ "vapour": "vapor",
1681
+ "vapours": "vapors",
1682
+ "verbalise": "verbalize",
1683
+ "verbalised": "verbalized",
1684
+ "verbalises": "verbalizes",
1685
+ "verbalising": "verbalizing",
1686
+ "victimisation": "victimization",
1687
+ "victimise": "victimize",
1688
+ "victimised": "victimized",
1689
+ "victimises": "victimizes",
1690
+ "victimising": "victimizing",
1691
+ "videodisc": "videodisk",
1692
+ "videodiscs": "videodisks",
1693
+ "vigour": "vigor",
1694
+ "visualisation": "visualization",
1695
+ "visualisations": "visualizations",
1696
+ "visualise": "visualize",
1697
+ "visualised": "visualized",
1698
+ "visualises": "visualizes",
1699
+ "visualising": "visualizing",
1700
+ "vocalisation": "vocalization",
1701
+ "vocalisations": "vocalizations",
1702
+ "vocalise": "vocalize",
1703
+ "vocalised": "vocalized",
1704
+ "vocalises": "vocalizes",
1705
+ "vocalising": "vocalizing",
1706
+ "vulcanised": "vulcanized",
1707
+ "vulgarisation": "vulgarization",
1708
+ "vulgarise": "vulgarize",
1709
+ "vulgarised": "vulgarized",
1710
+ "vulgarises": "vulgarizes",
1711
+ "vulgarising": "vulgarizing",
1712
+ "waggon": "wagon",
1713
+ "waggons": "wagons",
1714
+ "watercolour": "watercolor",
1715
+ "watercolours": "watercolors",
1716
+ "weaselled": "weaseled",
1717
+ "weaselling": "weaseling",
1718
+ "westernisation": "westernization",
1719
+ "westernise": "westernize",
1720
+ "westernised": "westernized",
1721
+ "westernises": "westernizes",
1722
+ "westernising": "westernizing",
1723
+ "womanise": "womanize",
1724
+ "womanised": "womanized",
1725
+ "womaniser": "womanizer",
1726
+ "womanisers": "womanizers",
1727
+ "womanises": "womanizes",
1728
+ "womanising": "womanizing",
1729
+ "woollen": "woolen",
1730
+ "woollens": "woolens",
1731
+ "woollies": "woolies",
1732
+ "woolly": "wooly",
1733
+ "worshipped": "worshiped",
1734
+ "worshipper": "worshiper",
1735
+ "worshipping": "worshiping",
1736
+ "yodelled": "yodeled",
1737
+ "yodelling": "yodeling",
1738
+ "yoghourt": "yogurt",
1739
+ "yoghourts": "yogurts",
1740
+ "yoghurt": "yogurt",
1741
+ "yoghurts": "yogurts"
1742
+ }
preprocessor_config.json ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f005a4dad4989a1c813c1ead90f882357b184d9a25c972b814e45f94bd255da
3
+ size 151098921
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ torch>=1.7
2
+ torchaudio
3
+ git+https://github.com/huggingface/transformers
4
+ git+https://github.com/huggingface/datasets
5
+ librosa
6
+ jiwer
7
+ evaluate>=0.3.0
8
+ more-itertools
9
+ tensorboard
run.sh ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ python run_speech_recognition_seq2seq_streaming.py \
2
+ --model_name_or_path="openai/whisper-tiny" \
3
+ --dataset_name="mozilla-foundation/common_voice_11_0" \
4
+ --dataset_config_name="bn" \
5
+ --language="bengali" \
6
+ --train_split_name="train+validation" \
7
+ --eval_split_name="test" \
8
+ --model_index_name="Whisper Tiny Bengali" \
9
+ --output_dir="./" \
10
+ --overwrite_output_dir \
11
+ --max_steps="5000" \
12
+ --per_device_train_batch_size="16" \
13
+ --per_device_eval_batch_size="8" \
14
+ --gradient_accumulation_steps="1" \
15
+ --gradient_checkpointing="False" \
16
+ --evaluation_strategy="steps" \
17
+ --eval_steps="1000" \
18
+ --save_strategy="steps" \
19
+ --save_steps="1000" \
20
+ --save_total_limit="5" \
21
+ --learning_rate="1e-5" \
22
+ --warmup_steps="500" \
23
+ --logging_steps="25" \
24
+ --weight_decay="0.01" \
25
+ --load_best_model_at_end="True" \
26
+ --metric_for_best_model="wer" \
27
+ --greater_is_better="False" \
28
+ --fp16="True" \
29
+ --tf32="True" \
30
+ --streaming="False" \
31
+ --generation_max_length="225" \
32
+ --length_column_name="input_length" \
33
+ --max_duration_in_seconds="30" \
34
+ --text_column_name="sentence" \
35
+ --freeze_feature_encoder="False" \
36
+ --report_to="tensorboard" \
37
+ --do_train \
38
+ --do_eval \
39
+ --predict_with_generate \
40
+ --do_normalize_eval \
41
+ --use_auth_token \
42
+ --push_to_hub
run_eval_whisper_streaming.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+
3
+ from transformers import pipeline
4
+ from transformers.models.whisper.english_normalizer import BasicTextNormalizer
5
+ from datasets import load_dataset, Audio
6
+ import evaluate
7
+
8
+ wer_metric = evaluate.load("wer")
9
+
10
+
11
+ def is_target_text_in_range(ref):
12
+ if ref.strip() == "ignore time segment in scoring":
13
+ return False
14
+ else:
15
+ return ref.strip() != ""
16
+
17
+
18
+ def get_text(sample):
19
+ if "text" in sample:
20
+ return sample["text"]
21
+ elif "sentence" in sample:
22
+ return sample["sentence"]
23
+ elif "normalized_text" in sample:
24
+ return sample["normalized_text"]
25
+ elif "transcript" in sample:
26
+ return sample["transcript"]
27
+ elif "transcription" in sample:
28
+ return sample["transcription"]
29
+ else:
30
+ raise ValueError(
31
+ f"Expected transcript column of either 'text', 'sentence', 'normalized_text' or 'transcript'. Got sample of "
32
+ ".join{sample.keys()}. Ensure a text column name is present in the dataset."
33
+ )
34
+
35
+
36
+ whisper_norm = BasicTextNormalizer()
37
+
38
+
39
+ def normalise(batch):
40
+ batch["norm_text"] = whisper_norm(get_text(batch))
41
+ return batch
42
+
43
+
44
+ def data(dataset):
45
+ for i, item in enumerate(dataset):
46
+ yield {**item["audio"], "reference": item["norm_text"]}
47
+
48
+
49
+ def main(args):
50
+ batch_size = args.batch_size
51
+ whisper_asr = pipeline(
52
+ "automatic-speech-recognition", model=args.model_id, device=args.device
53
+ )
54
+
55
+ whisper_asr.model.config.forced_decoder_ids = (
56
+ whisper_asr.tokenizer.get_decoder_prompt_ids(
57
+ language=args.language, task="transcribe"
58
+ )
59
+ )
60
+
61
+ dataset = load_dataset(
62
+ args.dataset,
63
+ args.config,
64
+ split=args.split,
65
+ streaming=args.streaming,
66
+ use_auth_token=True,
67
+ )
68
+
69
+ # Only uncomment for debugging
70
+ dataset = dataset.take(args.max_eval_samples)
71
+
72
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
73
+ dataset = dataset.map(normalise)
74
+ dataset = dataset.filter(is_target_text_in_range, input_columns=["norm_text"])
75
+
76
+ predictions = []
77
+ references = []
78
+
79
+ # run streamed inference
80
+ for out in whisper_asr(data(dataset), batch_size=batch_size):
81
+ predictions.append(whisper_norm(out["text"]))
82
+ references.append(out["reference"][0])
83
+
84
+ wer = wer_metric.compute(references=references, predictions=predictions)
85
+ wer = round(100 * wer, 2)
86
+
87
+ print("WER:", wer)
88
+ evaluate.push_to_hub(
89
+ model_id=args.model_id,
90
+ metric_value=wer,
91
+ metric_type="wer",
92
+ metric_name="WER",
93
+ dataset_name=args.dataset,
94
+ dataset_type=args.dataset,
95
+ dataset_split=args.split,
96
+ dataset_config=args.config,
97
+ task_type="automatic-speech-recognition",
98
+ task_name="Automatic Speech Recognition"
99
+ )
100
+
101
+
102
+ if __name__ == "__main__":
103
+ parser = argparse.ArgumentParser()
104
+
105
+ parser.add_argument(
106
+ "--model_id",
107
+ type=str,
108
+ required=True,
109
+ help="Model identifier. Should be loadable with 🤗 Transformers",
110
+ )
111
+ parser.add_argument(
112
+ "--dataset",
113
+ type=str,
114
+ default="mozilla-foundation/common_voice_11_0",
115
+ help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets",
116
+ )
117
+ parser.add_argument(
118
+ "--config",
119
+ type=str,
120
+ required=True,
121
+ help="Config of the dataset. *E.g.* `'en'` for the English split of Common Voice",
122
+ )
123
+ parser.add_argument(
124
+ "--split",
125
+ type=str,
126
+ default="test",
127
+ help="Split of the dataset. *E.g.* `'test'`",
128
+ )
129
+
130
+ parser.add_argument(
131
+ "--device",
132
+ type=int,
133
+ default=-1,
134
+ help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
135
+ )
136
+ parser.add_argument(
137
+ "--batch_size",
138
+ type=int,
139
+ default=16,
140
+ help="Number of samples to go through each streamed batch.",
141
+ )
142
+ parser.add_argument(
143
+ "--max_eval_samples",
144
+ type=int,
145
+ default=None,
146
+ help="Number of samples to be evaluated. Put a lower number e.g. 64 for testing this script.",
147
+ )
148
+ parser.add_argument(
149
+ "--streaming",
150
+ type=bool,
151
+ default=True,
152
+ help="Choose whether you'd like to download the entire dataset or stream it during the evaluation.",
153
+ )
154
+ parser.add_argument(
155
+ "--language",
156
+ type=str,
157
+ required=True,
158
+ help="Two letter language code for the transcription language, e.g. use 'en' for English.",
159
+ )
160
+ args = parser.parse_args()
161
+
162
+ main(args)
run_speech_recognition_seq2seq_streaming.py ADDED
@@ -0,0 +1,635 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding=utf-8
3
+ # Copyright 2022 The HuggingFace Team. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """
17
+ Fine-tuning the library models for sequence to sequence speech recognition
18
+ with 🤗 Datasets' streaming mode.
19
+ """
20
+ # You can also adapt this script for your own sequence to sequence speech
21
+ # recognition task. Pointers for this are left as comments.
22
+
23
+ import logging
24
+ import os
25
+ import sys
26
+ from dataclasses import dataclass, field
27
+ from typing import Any, Dict, List, Optional, Union
28
+
29
+ import datasets
30
+ import torch
31
+ from datasets import DatasetDict, IterableDatasetDict, interleave_datasets, load_dataset
32
+ from torch.utils.data import IterableDataset
33
+
34
+ abs_path = os.path.abspath('.')
35
+ base_dir = os.path.dirname(abs_path)
36
+ os.environ['TRANSFORMERS_CACHE'] = os.path.join(base_dir, 'models_cache')
37
+ os.environ['HF_DATASETS_CACHE'] = os.path.join(base_dir, 'datasets_cache')
38
+
39
+
40
+ import evaluate
41
+ import transformers
42
+ from transformers import (
43
+ AutoConfig,
44
+ AutoFeatureExtractor,
45
+ AutoModelForSpeechSeq2Seq,
46
+ AutoProcessor,
47
+ AutoTokenizer,
48
+ HfArgumentParser,
49
+ Seq2SeqTrainer,
50
+ Seq2SeqTrainingArguments,
51
+ TrainerCallback,
52
+ set_seed,
53
+ )
54
+ from transformers.models.whisper.english_normalizer import BasicTextNormalizer
55
+ from transformers.trainer_pt_utils import IterableDatasetShard
56
+ from transformers.trainer_utils import get_last_checkpoint, is_main_process
57
+ from transformers.utils import check_min_version, send_example_telemetry
58
+ from transformers.utils.versions import require_version
59
+
60
+
61
+ # Will error if the minimal version of Transformers is not installed. Remove at your own risks.
62
+ check_min_version("4.25.0.dev0")
63
+
64
+ require_version("datasets>=1.18.2", "To fix: pip install -r examples/pytorch/speech-recognition/requirements.txt")
65
+
66
+ logger = logging.getLogger(__name__)
67
+
68
+
69
+ @dataclass
70
+ class ModelArguments:
71
+ """
72
+ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
73
+ """
74
+
75
+ model_name_or_path: str = field(
76
+ metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
77
+ )
78
+ config_name: Optional[str] = field(
79
+ default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
80
+ )
81
+ tokenizer_name: Optional[str] = field(
82
+ default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
83
+ )
84
+ feature_extractor_name: Optional[str] = field(
85
+ default=None, metadata={"help": "feature extractor name or path if not the same as model_name"}
86
+ )
87
+ cache_dir: Optional[str] = field(
88
+ default=None,
89
+ metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
90
+ )
91
+ use_fast_tokenizer: bool = field(
92
+ default=True,
93
+ metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
94
+ )
95
+ model_revision: str = field(
96
+ default="main",
97
+ metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
98
+ )
99
+ use_auth_token: bool = field(
100
+ default=False,
101
+ metadata={
102
+ "help": (
103
+ "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
104
+ "with private models)."
105
+ )
106
+ },
107
+ )
108
+ freeze_feature_encoder: bool = field(
109
+ default=True, metadata={"help": "Whether to freeze the feature encoder layers of the model."}
110
+ )
111
+ freeze_encoder: bool = field(
112
+ default=False, metadata={"help": "Whether to freeze the entire encoder of the seq2seq model."}
113
+ )
114
+ forced_decoder_ids: List[List[int]] = field(
115
+ default=None,
116
+ metadata={
117
+ "help": (
118
+ "A list of pairs of integers which indicates a mapping from generation indices to token indices "
119
+ "that will be forced before sampling. For example, [[0, 123]] means the first generated token "
120
+ "will always be a token of index 123."
121
+ )
122
+ },
123
+ )
124
+ suppress_tokens: List[int] = field(
125
+ default=None, metadata={"help": "A list of tokens that will be suppressed at generation."}
126
+ )
127
+ model_index_name: str = field(default=None, metadata={"help": "Pretty name for the model card."})
128
+
129
+
130
+ @dataclass
131
+ class DataTrainingArguments:
132
+ """
133
+ Arguments pertaining to what data we are going to input our model for training and eval.
134
+ """
135
+
136
+ dataset_name: str = field(
137
+ default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
138
+ )
139
+ dataset_config_name: Optional[str] = field(
140
+ default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
141
+ )
142
+ text_column: Optional[str] = field(
143
+ default=None,
144
+ metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
145
+ )
146
+ max_train_samples: Optional[int] = field(
147
+ default=None,
148
+ metadata={
149
+ "help": (
150
+ "For debugging purposes or quicker training, truncate the number of training examples to this "
151
+ "value if set."
152
+ )
153
+ },
154
+ )
155
+ max_eval_samples: Optional[int] = field(
156
+ default=None,
157
+ metadata={
158
+ "help": (
159
+ "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
160
+ "value if set."
161
+ )
162
+ },
163
+ )
164
+ audio_column_name: str = field(
165
+ default="audio",
166
+ metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"},
167
+ )
168
+ text_column_name: str = field(
169
+ default="text",
170
+ metadata={"help": "The name of the dataset column containing the text data. Defaults to 'text'"},
171
+ )
172
+ max_duration_in_seconds: float = field(
173
+ default=20.0,
174
+ metadata={
175
+ "help": (
176
+ "Truncate audio files that are longer than `max_duration_in_seconds` seconds to"
177
+ " 'max_duration_in_seconds`"
178
+ )
179
+ },
180
+ )
181
+ min_duration_in_seconds: float = field(
182
+ default=0.0, metadata={"help": "Filter audio files that are shorter than `min_duration_in_seconds` seconds"}
183
+ )
184
+ train_split_name: str = field(
185
+ default="train",
186
+ metadata={
187
+ "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
188
+ },
189
+ )
190
+ eval_split_name: str = field(
191
+ default="test",
192
+ metadata={
193
+ "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
194
+ },
195
+ )
196
+ do_lower_case: bool = field(
197
+ default=False,
198
+ metadata={"help": "Whether the target text should be lower cased."},
199
+ )
200
+ do_remove_punctuation: bool = field(
201
+ default=False,
202
+ metadata={"help": "Whether the target text should be striped of punctuation."},
203
+ )
204
+ do_normalize_eval: bool = field(
205
+ default=True,
206
+ metadata={"help": "Whether to normalise the references and predictions in the eval WER calculation."},
207
+ )
208
+ language: str = field(
209
+ default=None,
210
+ metadata={
211
+ "help": (
212
+ "Language for multilingual fine-tuning. This argument should be set for multilingual fine-tuning "
213
+ "only. For English speech recognition, it should be set to `None`."
214
+ )
215
+ },
216
+ )
217
+ task: str = field(
218
+ default="transcribe",
219
+ metadata={"help": "Task, either `transcribe` for speech recognition or `translate` for speech translation."},
220
+ )
221
+ shuffle_buffer_size: Optional[int] = field(
222
+ default=500,
223
+ metadata={
224
+ "help": (
225
+ "The number of streamed examples to download before shuffling them. The large the buffer, "
226
+ "the closer it is to real offline shuffling."
227
+ )
228
+ },
229
+ )
230
+ streaming: bool = field(
231
+ default=True,
232
+ metadata={"help": "Whether to use streaming mode to load and pre-process the data."},
233
+ )
234
+
235
+
236
+ @dataclass
237
+ class DataCollatorSpeechSeq2SeqWithPadding:
238
+ """
239
+ Data collator that will dynamically pad the inputs received.
240
+ Args:
241
+ processor ([`WhisperProcessor`])
242
+ The processor used for processing the data.
243
+ decoder_start_token_id (`int`)
244
+ The begin-of-sentence of the decoder.
245
+ """
246
+
247
+ processor: Any
248
+ decoder_start_token_id: int
249
+
250
+ def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
251
+ # split inputs and labels since they have to be of different lengths and need
252
+ # different padding methods
253
+ model_input_name = self.processor.model_input_names[0]
254
+ input_features = [{model_input_name: feature[model_input_name]} for feature in features]
255
+ label_features = [{"input_ids": feature["labels"]} for feature in features]
256
+
257
+ batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
258
+
259
+ labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
260
+
261
+ # replace padding with -100 to ignore loss correctly
262
+ labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
263
+
264
+ # if bos token is appended in previous tokenization step,
265
+ # cut bos token here as it's append later anyways
266
+ if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
267
+ labels = labels[:, 1:]
268
+
269
+ batch["labels"] = labels
270
+
271
+ return batch
272
+
273
+
274
+ def load_maybe_streaming_dataset(dataset_name, dataset_config_name, split="train", streaming=True, **kwargs):
275
+ """
276
+ Utility function to load a dataset in streaming mode. For datasets with multiple splits,
277
+ each split is loaded individually and then splits combined by taking alternating examples from
278
+ each (interleaving).
279
+ """
280
+ if "+" in split:
281
+ # load multiple splits separated by the `+` symbol with streaming mode
282
+ dataset_splits = [
283
+ load_dataset(dataset_name, dataset_config_name, split=split_name, streaming=streaming, **kwargs)
284
+ for split_name in split.split("+")
285
+ ]
286
+ # interleave multiple splits to form one dataset
287
+ interleaved_dataset = interleave_datasets(dataset_splits)
288
+ return interleaved_dataset
289
+ else:
290
+ # load a single split *with* streaming mode
291
+ dataset = load_dataset(dataset_name, dataset_config_name, split=split, streaming=streaming, **kwargs)
292
+ return dataset
293
+
294
+
295
+ def main():
296
+ # 1. Parse input arguments
297
+ # See all possible arguments in src/transformers/training_args.py
298
+ # or by passing the --help flag to this script.
299
+ # We now keep distinct sets of args, for a cleaner separation of concerns.
300
+ parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
301
+
302
+ if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
303
+ # If we pass only one argument to the script and it's the path to a json file,
304
+ # let's parse it to get our arguments.
305
+ model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
306
+ else:
307
+ model_args, data_args, training_args = parser.parse_args_into_dataclasses()
308
+
309
+ # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The
310
+ # information sent is the one passed as arguments along with your Python/PyTorch versions.
311
+ send_example_telemetry("run_speech_recognition_seq2seq_streaming", model_args, data_args)
312
+
313
+ # 2. Setup logging
314
+ logging.basicConfig(
315
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
316
+ datefmt="%m/%d/%Y %H:%M:%S",
317
+ handlers=[logging.StreamHandler(sys.stdout)],
318
+ )
319
+ log_level = training_args.get_process_log_level()
320
+ logger.setLevel(log_level)
321
+ datasets.utils.logging.set_verbosity(log_level)
322
+ transformers.utils.logging.set_verbosity(log_level)
323
+ transformers.utils.logging.enable_default_handler()
324
+ transformers.utils.logging.enable_explicit_format()
325
+
326
+ logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
327
+
328
+ # Log on each process the small summary:
329
+ logger.warning(
330
+ f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
331
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
332
+ )
333
+ logger.info(f"Training/evaluation parameters {training_args}")
334
+
335
+ # Set the verbosity to info of the Transformers logger (on main process only):
336
+ if is_main_process(training_args.local_rank):
337
+ transformers.utils.logging.set_verbosity_info()
338
+ logger.info("Training/evaluation parameters %s", training_args)
339
+
340
+ # 3. Detecting last checkpoint and eventually continue from last checkpoint
341
+ last_checkpoint = None
342
+ if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
343
+ last_checkpoint = get_last_checkpoint(training_args.output_dir)
344
+ if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
345
+ raise ValueError(
346
+ f"Output directory ({training_args.output_dir}) already exists and is not empty. "
347
+ "Use --overwrite_output_dir to overcome."
348
+ )
349
+ elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
350
+ logger.info(
351
+ f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
352
+ "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
353
+ )
354
+
355
+ # Set seed before initializing model.
356
+ set_seed(training_args.seed)
357
+
358
+ # 4. Load dataset
359
+ raw_datasets = IterableDatasetDict() if data_args.streaming else DatasetDict()
360
+
361
+ if training_args.do_train:
362
+ raw_datasets["train"] = load_maybe_streaming_dataset(
363
+ data_args.dataset_name,
364
+ data_args.dataset_config_name,
365
+ split=data_args.train_split_name,
366
+ use_auth_token=True if model_args.use_auth_token else None,
367
+ streaming=data_args.streaming,
368
+ )
369
+
370
+ if training_args.do_eval:
371
+ raw_datasets["eval"] = load_maybe_streaming_dataset(
372
+ data_args.dataset_name,
373
+ data_args.dataset_config_name,
374
+ split=data_args.eval_split_name,
375
+ use_auth_token=True if model_args.use_auth_token else None,
376
+ streaming=data_args.streaming,
377
+ )
378
+
379
+ raw_datasets_features = list(next(iter(raw_datasets.values())).features.keys())
380
+
381
+ if data_args.audio_column_name not in raw_datasets_features:
382
+ raise ValueError(
383
+ f"--audio_column_name '{data_args.audio_column_name}' not found in dataset '{data_args.dataset_name}'. "
384
+ "Make sure to set `--audio_column_name` to the correct audio column - one of "
385
+ f"{', '.join(raw_datasets_features)}."
386
+ )
387
+
388
+ if data_args.text_column_name not in raw_datasets_features:
389
+ raise ValueError(
390
+ f"--text_column_name {data_args.text_column_name} not found in dataset '{data_args.dataset_name}'. "
391
+ "Make sure to set `--text_column_name` to the correct text column - one of "
392
+ f"{', '.join(raw_datasets_features)}."
393
+ )
394
+
395
+ # 5. Load pretrained model, tokenizer, and feature extractor
396
+ #
397
+ # Distributed training:
398
+ # The .from_pretrained methods guarantee that only one local process can concurrently
399
+ config = AutoConfig.from_pretrained(
400
+ model_args.config_name if model_args.config_name else model_args.model_name_or_path,
401
+ cache_dir=model_args.cache_dir,
402
+ revision=model_args.model_revision,
403
+ use_auth_token=True if model_args.use_auth_token else None,
404
+ )
405
+
406
+ config.update({"forced_decoder_ids": model_args.forced_decoder_ids, "suppress_tokens": model_args.suppress_tokens})
407
+
408
+ if training_args.gradient_checkpointing:
409
+ config.update({"use_cache": False})
410
+
411
+ feature_extractor = AutoFeatureExtractor.from_pretrained(
412
+ model_args.feature_extractor_name if model_args.feature_extractor_name else model_args.model_name_or_path,
413
+ cache_dir=model_args.cache_dir,
414
+ revision=model_args.model_revision,
415
+ use_auth_token=True if model_args.use_auth_token else None,
416
+ )
417
+ tokenizer = AutoTokenizer.from_pretrained(
418
+ model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
419
+ cache_dir=model_args.cache_dir,
420
+ use_fast=model_args.use_fast_tokenizer,
421
+ revision=model_args.model_revision,
422
+ use_auth_token=True if model_args.use_auth_token else None,
423
+ )
424
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
425
+ model_args.model_name_or_path,
426
+ config=config,
427
+ cache_dir=model_args.cache_dir,
428
+ revision=model_args.model_revision,
429
+ use_auth_token=True if model_args.use_auth_token else None,
430
+ )
431
+
432
+ if model.config.decoder_start_token_id is None:
433
+ raise ValueError("Make sure that `config.decoder_start_token_id` is correctly defined")
434
+
435
+ if model_args.freeze_feature_encoder:
436
+ model.freeze_feature_encoder()
437
+
438
+ if model_args.freeze_encoder:
439
+ model.freeze_encoder()
440
+
441
+ if data_args.language is not None:
442
+ # We only need to set the task id when the language is specified (i.e. in a multilingual setting)
443
+ tokenizer.set_prefix_tokens(language=data_args.language, task=data_args.task)
444
+
445
+ # 6. Resample speech dataset if necessary
446
+ dataset_sampling_rate = next(iter(raw_datasets.values())).features[data_args.audio_column_name].sampling_rate
447
+ if dataset_sampling_rate != feature_extractor.sampling_rate:
448
+ raw_datasets = raw_datasets.cast_column(
449
+ data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate)
450
+ )
451
+
452
+ # 7. Preprocessing the datasets.
453
+ # We need to read the audio files as arrays and tokenize the targets.
454
+ max_input_length = data_args.max_duration_in_seconds * feature_extractor.sampling_rate
455
+ min_input_length = data_args.min_duration_in_seconds * feature_extractor.sampling_rate
456
+ audio_column_name = data_args.audio_column_name
457
+ text_column_name = data_args.text_column_name
458
+ model_input_name = feature_extractor.model_input_names[0]
459
+ do_lower_case = data_args.do_lower_case
460
+ do_remove_punctuation = data_args.do_remove_punctuation
461
+ normalizer = BasicTextNormalizer() # 'official' text normalizer from OpenAI
462
+
463
+ if data_args.max_train_samples is not None:
464
+ raw_datasets["train"] = (
465
+ raw_datasets["train"].take(data_args.max_train_samples)
466
+ if data_args.streaming
467
+ else raw_datasets["train"].select(range(data_args.max_train_samples))
468
+ )
469
+
470
+ if data_args.max_eval_samples is not None:
471
+ raw_datasets["eval"] = (
472
+ raw_datasets["eval"].take(data_args.max_eval_samples)
473
+ if data_args.streaming
474
+ else raw_datasets["eval"].select(range(data_args.max_eval_samples))
475
+ )
476
+
477
+ def prepare_dataset(batch):
478
+ # process audio
479
+ sample = batch[audio_column_name]
480
+ inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
481
+ # process audio length
482
+ batch[model_input_name] = inputs.get(model_input_name)[0]
483
+ batch["input_length"] = len(sample["array"])
484
+
485
+ # process targets
486
+ input_str = batch[text_column_name].lower() if do_lower_case else batch[text_column_name]
487
+ if do_remove_punctuation:
488
+ input_str = normalizer(input_str).strip()
489
+ batch["labels"] = tokenizer(input_str).input_ids
490
+ return batch
491
+
492
+ with training_args.main_process_first(desc="dataset map pre-processing"):
493
+ vectorized_datasets = raw_datasets.map(
494
+ prepare_dataset,
495
+ remove_columns=raw_datasets_features,
496
+ ).with_format("torch")
497
+
498
+ if training_args.do_train and data_args.streaming:
499
+ # manually shuffle if streaming (done by the trainer for non-streaming)
500
+ vectorized_datasets["train"] = vectorized_datasets["train"].shuffle(
501
+ buffer_size=data_args.shuffle_buffer_size,
502
+ seed=training_args.seed,
503
+ )
504
+
505
+ # filter training data that is shorter than min_input_length or longer than
506
+ # max_input_length
507
+ def is_audio_in_length_range(length):
508
+ return min_input_length < length < max_input_length
509
+
510
+ if training_args.do_train:
511
+ vectorized_datasets["train"] = vectorized_datasets["train"].filter(
512
+ is_audio_in_length_range,
513
+ input_columns=["input_length"],
514
+ )
515
+
516
+ # 8. Load Metric
517
+ metric = evaluate.load("wer")
518
+ do_normalize_eval = data_args.do_normalize_eval
519
+
520
+ def compute_metrics(pred):
521
+ pred_ids = pred.predictions
522
+
523
+ pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id
524
+
525
+ pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
526
+ # we do not want to group tokens when computing the metrics
527
+ label_str = tokenizer.batch_decode(pred.label_ids, skip_special_tokens=True)
528
+
529
+ if do_normalize_eval:
530
+ pred_str = [normalizer(pred) for pred in pred_str]
531
+ label_str = [normalizer(label) for label in label_str]
532
+ # filtering step to only evaluate the samples that correspond to non-zero references:
533
+ pred_str = [pred_str[i] for i in range(len(pred_str)) if len(label_str[i]) > 0]
534
+ label_str = [label_str[i] for i in range(len(label_str)) if len(label_str[i]) > 0]
535
+
536
+ wer = 100 * metric.compute(predictions=pred_str, references=label_str)
537
+
538
+ return {"wer": wer}
539
+
540
+ # 9. Create a single speech processor
541
+ if is_main_process(training_args.local_rank):
542
+ # save feature extractor, tokenizer and config
543
+ feature_extractor.save_pretrained(training_args.output_dir)
544
+ tokenizer.save_pretrained(training_args.output_dir)
545
+ config.save_pretrained(training_args.output_dir)
546
+
547
+ processor = AutoProcessor.from_pretrained(training_args.output_dir)
548
+
549
+ # 10. Define data collator
550
+ data_collator = DataCollatorSpeechSeq2SeqWithPadding(
551
+ processor=processor,
552
+ decoder_start_token_id=model.config.decoder_start_token_id,
553
+ )
554
+
555
+ # 11. Configure Trainer
556
+ # Trainer callback to reinitialise and reshuffle the streamable datasets at the beginning of each epoch
557
+ # Only required for streaming: Trainer automatically shuffles non-streaming datasets
558
+ class ShuffleCallback(TrainerCallback):
559
+ def on_epoch_begin(self, args, state, control, train_dataloader, **kwargs):
560
+ if isinstance(train_dataloader.dataset, IterableDatasetShard):
561
+ pass # set_epoch() is handled by the Trainer
562
+ elif isinstance(train_dataloader.dataset, IterableDataset):
563
+ train_dataloader.dataset.set_epoch(train_dataloader.dataset._epoch + 1)
564
+
565
+ # Initialize Trainer
566
+ trainer = Seq2SeqTrainer(
567
+ model=model,
568
+ args=training_args,
569
+ train_dataset=vectorized_datasets["train"] if training_args.do_train else None,
570
+ eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None,
571
+ tokenizer=feature_extractor,
572
+ data_collator=data_collator,
573
+ compute_metrics=compute_metrics if training_args.predict_with_generate else None,
574
+ callbacks=[ShuffleCallback()] if data_args.streaming else None,
575
+ )
576
+
577
+ # 12. Training
578
+ if training_args.do_train:
579
+ checkpoint = None
580
+ if training_args.resume_from_checkpoint is not None:
581
+ checkpoint = training_args.resume_from_checkpoint
582
+ elif last_checkpoint is not None:
583
+ checkpoint = last_checkpoint
584
+ train_result = trainer.train(resume_from_checkpoint=checkpoint)
585
+ trainer.save_model() # Saves the feature extractor too for easy upload
586
+
587
+ metrics = train_result.metrics
588
+ if data_args.max_train_samples:
589
+ metrics["train_samples"] = data_args.max_train_samples
590
+ trainer.log_metrics("train", metrics)
591
+ trainer.save_metrics("train", metrics)
592
+ trainer.save_state()
593
+
594
+ # 13. Evaluation
595
+ results = {}
596
+ if training_args.do_eval:
597
+ logger.info("*** Evaluate ***")
598
+ metrics = trainer.evaluate(
599
+ metric_key_prefix="eval",
600
+ max_length=training_args.generation_max_length,
601
+ num_beams=training_args.generation_num_beams,
602
+ )
603
+ if data_args.max_eval_samples:
604
+ metrics["eval_samples"] = data_args.max_eval_samples
605
+
606
+ trainer.log_metrics("eval", metrics)
607
+ trainer.save_metrics("eval", metrics)
608
+
609
+ # 14. Write Training Stats
610
+ kwargs = {
611
+ "finetuned_from": model_args.model_name_or_path,
612
+ "tasks": "automatic-speech-recognition",
613
+ "tags": "whisper-event",
614
+ }
615
+ if data_args.dataset_name is not None:
616
+ kwargs["dataset_tags"] = data_args.dataset_name
617
+ if data_args.dataset_config_name is not None:
618
+ kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}"
619
+ else:
620
+ kwargs["dataset"] = data_args.dataset_name
621
+ if "common_voice" in data_args.dataset_name:
622
+ kwargs["language"] = data_args.dataset_config_name.split('-')[0]
623
+ if model_args.model_index_name is not None:
624
+ kwargs["model_name"] = model_args.model_index_name
625
+
626
+ if training_args.push_to_hub:
627
+ trainer.push_to_hub(**kwargs)
628
+ else:
629
+ trainer.create_model_card(**kwargs)
630
+
631
+ return results
632
+
633
+
634
+ if __name__ == "__main__":
635
+ main()
runs/Jan18_12-56-26_mamun-desktop/1674026928.3192365/events.out.tfevents.1674026928.mamun-desktop.5642.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a752d0c2162d3e9226e27a8d37822d8007f28f7178b485e4bd466dc60f788d8f
3
+ size 5875
runs/Jan18_12-56-26_mamun-desktop/events.out.tfevents.1674026928.mamun-desktop.5642.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07806849165ddea662c9780f7c0bb11b65c191d3ccc0320ad5600c414d7dc361
3
+ size 4260
runs/Jan18_13-42-16_mamun-desktop/1674028391.1765018/events.out.tfevents.1674028391.mamun-desktop.414182.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a63bac39bc5a3d29bca6e7a401e360dafb8813717c9890267a8f395ca7e256e3
3
+ size 5875
runs/Jan18_13-42-16_mamun-desktop/events.out.tfevents.1674028391.mamun-desktop.414182.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17f067288b429c209aaf2cbd12763ab01c1f021781c474e4068281c93470d1b6
3
+ size 4260
runs/Jan18_13-54-03_mamun-desktop/1674028489.236082/events.out.tfevents.1674028489.mamun-desktop.550378.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f211e73303fc1098f02ed7737614676c2e4f3d917186cf2b33a88e1117049567
3
+ size 5875
runs/Jan18_13-54-03_mamun-desktop/events.out.tfevents.1674028489.mamun-desktop.550378.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51c2b7954ef29f9fe55c7761c5dd2b7827864b827daf0f949452740dcf007367
3
+ size 4258
runs/Jan18_13-55-30_mamun-desktop/1674028577.9689484/events.out.tfevents.1674028577.mamun-desktop.550597.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e020f11e8fffdc3cd9e039164e9943a9c4a7ac8e5aef998e3e751c3daaa3f498
3
+ size 5875
runs/Jan18_13-55-30_mamun-desktop/events.out.tfevents.1674028577.mamun-desktop.550597.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dab3d94e5d3f3655f9784d5e92226b2181945d4e9f36a220dc506598a6bd8fec
3
+ size 4258
runs/Jan18_13-56-56_mamun-desktop/1674028663.0261796/events.out.tfevents.1674028663.mamun-desktop.550835.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:002842f0fca227a5e8973c4aeec7752b247f405e3663209eed1ca80e4c796d7b
3
+ size 5875
runs/Jan18_13-56-56_mamun-desktop/events.out.tfevents.1674028663.mamun-desktop.550835.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e8eb361652a2b8cbd2a8fd404c4777c537344f5486517f294b434cd30d1193f
3
+ size 10840
special_tokens_map.json ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<|startoftranscript|>",
5
+ "<|en|>",
6
+ "<|zh|>",
7
+ "<|de|>",
8
+ "<|es|>",
9
+ "<|ru|>",
10
+ "<|ko|>",
11
+ "<|fr|>",
12
+ "<|ja|>",
13
+ "<|pt|>",
14
+ "<|tr|>",
15
+ "<|pl|>",
16
+ "<|ca|>",
17
+ "<|nl|>",
18
+ "<|ar|>",
19
+ "<|sv|>",
20
+ "<|it|>",
21
+ "<|id|>",
22
+ "<|hi|>",
23
+ "<|fi|>",
24
+ "<|vi|>",
25
+ "<|iw|>",
26
+ "<|uk|>",
27
+ "<|el|>",
28
+ "<|ms|>",
29
+ "<|cs|>",
30
+ "<|ro|>",
31
+ "<|da|>",
32
+ "<|hu|>",
33
+ "<|ta|>",
34
+ "<|no|>",
35
+ "<|th|>",
36
+ "<|ur|>",
37
+ "<|hr|>",
38
+ "<|bg|>",
39
+ "<|lt|>",
40
+ "<|la|>",
41
+ "<|mi|>",
42
+ "<|ml|>",
43
+ "<|cy|>",
44
+ "<|sk|>",
45
+ "<|te|>",
46
+ "<|fa|>",
47
+ "<|lv|>",
48
+ "<|bn|>",
49
+ "<|sr|>",
50
+ "<|az|>",
51
+ "<|sl|>",
52
+ "<|kn|>",
53
+ "<|et|>",
54
+ "<|mk|>",
55
+ "<|br|>",
56
+ "<|eu|>",
57
+ "<|is|>",
58
+ "<|hy|>",
59
+ "<|ne|>",
60
+ "<|mn|>",
61
+ "<|bs|>",
62
+ "<|kk|>",
63
+ "<|sq|>",
64
+ "<|sw|>",
65
+ "<|gl|>",
66
+ "<|mr|>",
67
+ "<|pa|>",
68
+ "<|si|>",
69
+ "<|km|>",
70
+ "<|sn|>",
71
+ "<|yo|>",
72
+ "<|so|>",
73
+ "<|af|>",
74
+ "<|oc|>",
75
+ "<|ka|>",
76
+ "<|be|>",
77
+ "<|tg|>",
78
+ "<|sd|>",
79
+ "<|gu|>",
80
+ "<|am|>",
81
+ "<|yi|>",
82
+ "<|lo|>",
83
+ "<|uz|>",
84
+ "<|fo|>",
85
+ "<|ht|>",
86
+ "<|ps|>",
87
+ "<|tk|>",
88
+ "<|nn|>",
89
+ "<|mt|>",
90
+ "<|sa|>",
91
+ "<|lb|>",
92
+ "<|my|>",
93
+ "<|bo|>",
94
+ "<|tl|>",
95
+ "<|mg|>",
96
+ "<|as|>",
97
+ "<|tt|>",
98
+ "<|haw|>",
99
+ "<|ln|>",
100
+ "<|ha|>",
101
+ "<|ba|>",
102
+ "<|jw|>",
103
+ "<|su|>",
104
+ "<|translate|>",
105
+ "<|transcribe|>",
106
+ "<|startoflm|>",
107
+ "<|startofprev|>",
108
+ "<|nocaptions|>",
109
+ "<|notimestamps|>"
110
+ ],
111
+ "bos_token": {
112
+ "content": "<|endoftext|>",
113
+ "lstrip": false,
114
+ "normalized": true,
115
+ "rstrip": false,
116
+ "single_word": false
117
+ },
118
+ "eos_token": {
119
+ "content": "<|endoftext|>",
120
+ "lstrip": false,
121
+ "normalized": true,
122
+ "rstrip": false,
123
+ "single_word": false
124
+ },
125
+ "pad_token": "<|endoftext|>",
126
+ "unk_token": {
127
+ "content": "",
128
+ "lstrip": false,
129
+ "normalized": true,
130
+ "rstrip": false,
131
+ "single_word": false
132
+ }
133
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "bos_token": {
5
+ "__type": "AddedToken",
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 1024,
22
+ "name_or_path": "openai/whisper-tiny",
23
+ "pad_token": null,
24
+ "processor_class": "WhisperProcessor",
25
+ "return_attention_mask": false,
26
+ "special_tokens_map_file": null,
27
+ "tokenizer_class": "WhisperTokenizer",
28
+ "unk_token": {
29
+ "__type": "AddedToken",
30
+ "content": "",
31
+ "lstrip": false,
32
+ "normalized": true,
33
+ "rstrip": false,
34
+ "single_word": false
35
+ }
36
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d7d372554ba8ea6218aeec2b8a0315e296c1d8621c01416e476391211ff1dc8
3
+ size 3579
vocab.json ADDED
The diff for this file is too large to render. See raw diff