Edit model card

Whisper Small - Italian

This model is a fine-tuned version of openai/whisper-small on the Common-voice-11.0 dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4549
  • Wer: 200.40

Model description

Whisper is a pre-trained model for automatic speech recognition (ASR) published in September 2022 by the authors Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as Wav2Vec 2.0, which are pre-trained on un-labelled audio data, Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise. This is an order of magnitude more data than the un-labelled audio data used to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this pre-training data is multilingual ASR data. This results in checkpoints that can be applied to over 96 languages, many of which are considered low-resource.

When scaled to 680,000 hours of labelled pre-training data, Whisper models demonstrate a strong ability to generalise to many datasets and domains. The pre-trained checkpoints achieve competitive results to state-of-the-art ASR systems, with near 3% word error rate (WER) on the test-clean subset of LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (c.f. Table 8 of the Whisper paper). The extensive multilingual ASR knowledge acquired by Whisper during pre-training can be leveraged for other low-resource languages; through fine-tuning, the pre-trained checkpoints can be adapted for specific datasets and languages to further improve upon these results.

Intended uses & limitations

This fine-tuned model goals are to experiment and to allow the authors to gain skills and knowledge on how this process is carried out. The model serves as basis for the development of a small gradio-hosted application that transcribes recordings and audio files in italian. This application also allows to insert a YouTube link of an Italian video ad gain a transciption.

The limitations of this project mainly regard the limited resources available to fine-tune the model, namely Google Colab free-version and a Google Drive used as feature storage, that had a limited space. The time dedicated to this project was also limited, as it had to fit within academic deadlines.

Training and evaluation data

The training was carried out on Google Colab platform, and the evalutation data (as the whole dataset) was taken from the Common-voice-11.0 dataset reducing the dataset to only 10% of the original dataset, to avoid the training the model for too much time.

Training procedure

The training was conducted on Google Colab, using Jupyter Notebook to write code and document the training. Google Drive was used as Feature store. Due to the limited resources of the free version of Google Colab, checkpointing was used to save partial results and then resume in a following run. The notebook was run 15 times, with approximately 40 min for each 100 steps of training for a total of 26.5h of training. Keep in mind that Google Colab was available to us for no more than 4 h a day, so around 7 days were necessary for training alone.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 8
  • training_steps: 4000
  • gradient_accumulation_steps: 2
  • save_steps: 100
  • eval_steps: 100

Training results

Run Number Step Training Loss Validation Loss Wer
1 100 1.2396 1.2330 176.40
2 200 0.7389 0.8331 80.49
2 300 0.2951 0.4261 70.20
2 400 0.2703 0.4051 101.60
3 500 0.2491 0.3923 112.20
3 600 0.1700 0.3860 107.10
3 700 0.1603 0.3836 90.36
4 800 0.1607 0.3786 135.00
4 900 0.1540 0.3783 99.05
4 1000 0.1562 0.3667 98.32
4 1100 0.0723 0.3757 158.90
5 1200 0.0769 0.3789 215.20
5 1300 0.0814 0.3779 170.50
5 1400 0.0786 0.3770 140.60
5 1500 0.0673 0.3777 137.10
6 1600 0.0339 0.3892 166.50
7 1700 0.0324 0.3963 170.90
7 1800 0.0348 0.4004 163.40
8 1900 0.0345 0.4016 158.60
8 2000 0.0346 0.4020 176.10
8 2100 0.0317 0.4001 134.70
9 2200 0.0173 0.4141 189.30
9 2300 0.0174 0.4106 175.00
9 2400 0.0165 0.4204 179.60
10 2500 0.0172 0.4185 186.10
10 2600 0.0142 0.4175 181.10
11 2700 0.0090 0.4325 161.70
11 2800 0.0069 0.4362 161.20
11 2900 0.0093 0.4342 157.50
12 3000 0.0076 0.4352 154.50
12 3100 0.0089 0.4394 184.30
13 3200 0.0063 0.4454 166.00
13 3300 0.0059 0.4476 179.20
13 3400 0.0058 0.4490 189.60
14 3500 0.0051 0.4502 194.20
14 3600 0.0064 0.4512 187.40
14 3700 0.0053 0.4520 190.20
14 3800 0.0049 0.4545 194.90
15 3900 0.0052 0.4546 199.60
15 4000 0.0054 0.4549 200.40

Framework versions

  • Transformers 4.36.0.dev0
  • Pytorch 2.1.0+cu118
  • Datasets 2.15.0
  • Tokenizers 0.15.0
Downloads last month
1
Safetensors
Model size
242M params
Tensor type
F32
·

Finetuned from

Dataset used to train Silemo/whisper-it

Space using Silemo/whisper-it 1