Background

The original Whisper models were trained using 680.000 hours via Large-Scale Weak Supervision of dataset build from the Internet. Human curated datasets like Common Voice or Fleur were not using during training. The hypothesis is that by fine-tuning Whisper using human curated datasets the quality can improve for a language or a particular domain.

These Whisper fine tuning models for Catalan language were created during early 2023 as part of the Hugging Face Whisper Sprint.

The models were finetuned using these scripts

For fine tuning we used Common Voice dataset version 11.

Learnings on fine-tuning Whisper models

The goal of the rest of the document is just to share knowledge in the spirit that can benefit others. Things as shared as it is.

1. Model improves when benchmarched against Common Voice.

The model improves in WER evaluation metric when it is evaluated against the Common Voice test dataset. If you take the sample of the small fine-tuned model, you see how the final WER is 8.5 while starting at WER 13.

2. Model degrades according to human evaluation

When doing human evaluation the results for finetuned Catalan language model were disapointing. The fine-tuned models clearly perform worse than the original OpenAI models as reported by all users (half dozen) that test them.

Our hypothesis is that the evaluation on Common Voice gives better results because the model is overfitted and has lost generalization capabilities.

3. Model degrades according evaluation with other datasets

Results doing an evaluation with other datasets:

	base	sc-base	small	sc-small	medium	sc-medium
15GdH9-curt	55.60	70.40	36.62	78.75	78.75	22.39
Ona_catalan-balear	71.28	71.01	44.68	49.20	49.20	28.72
Son_Goku_catalan_valencian_voice	51.90	85.44	39.87	65.19	18.99	71.52
Universal_Declaration_of_Human_Rights	47.12	36.45	39.14	75.59	44.37	27.79

As you can see, the fine-tunned models perform worse in most of the scenarios than OpenAI models.

Legend:

"sc-" Indicates Softcatalà fine-tuned model
The scores are WER metrics

4. Whisper fine tuned models clients provide different quality

Summary as March 2023:

a. OpenAI Whisper implementation does not support out of the box inference on fine-tuned models, only on OpenAI models.

b. HuggingFace Whisper implementation performs poorly. This can be really misleading when doing evaluations, since HuggingFace is the stack used for fine-tuning

c. We have only been able to use the models reliable with Whisper.cpp and CTranslate 2 inference clients.

See how diferent clients can impact the WER over when doing inference same file:

Whisper Client	WER
OpenAI	27.32
Whisper.cpp 1.2.1	38.89
HuggingFace	69.63
CTranslate2 3.10.3	28.08

We strongly recommend using CTranslate2 as inference client.

5. Fine-tunning degrades timestamp prediction

Whisper uses timestamp tokens to indicate the timestamps of the transcribed texts.

The training scripts available for fine tunning do not generate timestamps tokens and the timestamp prediciton degradates.

This is important since many people uses Whisper models to create video subtitles were the timestamp prediction is important

Next steps

There is the possibility that fine-tuning can not improve Whisper or only in certain scenarios (domain adaptation). See Nickolay Shmyrev vision and tests.

Potential next steps:

The training scripts have to be improved to include timestamp tokens
New trainings should include more corpus than just Common Voice