Spaces:

openai
/

whisper

Running on L40S

App Files Files Community

132

performance drops and Hallucination Increases after finetuning the Whisper-small

#100

by anahar - opened Dec 18, 2023

Discussion

anahar

Dec 18, 2023

Hello,

I'm currently working on fine-tuning the Whisper-Small model for my specific use case. My dataset consists of Hinglish (a mix of Hindi and English) audio samples paired with their corresponding English text. I aim to generate outputs in English only. I am fine-tuning the model for translation task. The custom dataset comprises approximately 200 hours of audio clips, each lasting between 15 to 30 seconds.

During training, I've noticed that both the training loss and validation loss decrease consistently, and the BLEU score on the validation set is showing improvement. However, when I perform inference on the testing set, the model's performance drops.

At checkpoint-3000 (3000 steps), my model's performance was marginally better (around 1%) than the base Whisper-Small model. Yet, upon further fine-tuning for additional steps, the model's performance on the testing set declined. I observed the model seemed to hallucinate too much(words and the sentence repetition), leading to a decrease in overall performance as the number of iterations increased.
the checkpoint 18000 has the worst performance because the hallucination was too much in the outputs.

what could be the possible cause for this? and how can I improve this?

Following the blog by @sanchit-gandhi (https://huggingface.co/blog/fine-tune-whisper), I made some adjustments to the fine tuning script:

1)Set the language to "Hindi" and task to "translation"
3)set model.config.forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language="hi", task="translate").
4)Changed the evaluation metric to "BLEU."

alfonsofr

Dec 18, 2023

•

edited Dec 18, 2023

Sounds like overfitting to me, but it could be anything. Check this out: "Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models" https://arxiv.org/abs/2205.10770

anahar

Dec 18, 2023

•

edited Dec 18, 2023

@alfonsofr although it sound like overfitting, but I tried all the checkpoint with 500 steps, 1000 steps and 2000 steps etc. the results are worse than the base model. only for checkpoint-3000 the results slightly improved (around 1%). and considering the amount of Data I have i.e.more than 200 hours, It's really tough to agree if it's because of overfitting. my instinct can be wrong as well, as it's hard to figure out where things are breaking.

the major problem I am facing is the hallucination basically Repetition of the words/sentences which is degrading the model performance.

I'll def. look at the paper which you mentioned and see if I can get any fruitful way which can help me to finetune the model.

EdgeFM

Aug 23, 2024

I met a similar question: the training loss decreases normally, while the WER on training/val dataset are both nearly 90%. So this must not be overfitting.

robaudenaerde

Nov 4, 2024

Did you find the cause? I'm experiencing similar issues with whisper-small on the common-voice dataset for Dutch..

KevinKibe

Feb 18

I'm facing the same problem while fine-tuning whisper-small, the loss gradually decreases but the WER is not consistent, I have also noticed repetition in the predicted outputs during the evaluation step, any solution found yet

anahar

Feb 18

We realized that overfitting was an issue because, given the model parameters, 200 hours of training data was relatively too small (which I was using earlier for training). After scaling the data, the model’s performance improved drastically. For cases with limited data, it’s better to use either the Whisper Tiny or Base models, or consider using LoRA for more efficient training.

anahar changed discussion status to closed Feb 18

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment