Higher WER on fleurs dataset

#1
by deepdml - opened

Amazing work!

Do you have any idea why WER on fleurs data is worse than on original paper?

imagen.png

Thank you very much! I am still trying to measure that, the evaluation script takes too long and the one that I reported is not accurate (I was using only a few steps to do the evaluation, so I¿ll change that) :S. I think that I am having better results in this model: juancopi81/whisper-medium-es. But my idea is that we are overfitting the common-voice-11 so it is worse in fleurs. What do you think?

Yes, I think so.

For example, I've trained whisper-medium spanish model over common-voice-11, voxpopuli, multilingual_librispeech and fleurs. And the WER on common-voice is worse than yours, but better on fleurs:

  • common-voice-11: 6.3465 %
  • google/fleurs: 4.03 %
  • facebook/voxpopuli: Calculating...
  • facebook/multilingual_librispeech: Calculating...

PD: Using the script provided in the Whisper Sprint (Dec. 2022) to get google/fleurs, facebook/voxpopuli and facebook/multilingual_librispeech WERs.

Let me calculate again on Fleurs removing the -max_eval_samples=16 and see if I can get a more accurate WER. Are you using a --max_eval_samples=16?

Oh I just read your readme file, I notice you removed the max_eval_samples

Yes, I've removed it and running on Google Colab takes about 1h15min..

python run_eval_whisper_streaming.py --model_id="deepdml/whisper-medium-mix-es" --dataset="google/fleurs" --config="es_419" --device=0 --language="es"

Thank you very much! I'm running it again removing that in Colab. I'll report the results :) But yes, I think I am mainly overfitting the common-voice I'll try now interleaving the datasets.

Hmmm I just notice that you trained it also using Fleurs and your model is showing a WER on fleurs worse than the one from OpenAI, so even mixing the datasets affects the results on Fleurs?

It's seems that. I'm analyzing the predictions on fleurs dataset with my model to compare with original OpenAI model.
I'll let you know my conclusions too.

Thank you!!

oh I just run it again in this model juancopi81/whisper-medium-es removing the max_eval_samples and it gave me a better WER score on fleurs of 5.88, I'll try with the other models and see what happens.

Ok, after checking fleurs test predictions I discover that the reference transcriptions are not 100% accurate... Some examples:

The correct sentece is "fue tanta la gente que se concentró que no todos pudieron acceder al funeral en la plaza san pedro" like the prediction.
image.png

The correct sentence is "los autobuses parten de la estación entre distritos al otro lado del río durante todo el día aunque la mayoría en especial los que viajan al este y a jakar también llamada bumthang salen entre las 6 30 y las 7 h"
image.png

Another problem is related with numbers, maybe we need to set a normalization because in some cases the model predicts the digit version and in other cases write it.

So for the above reasons we have a high WER on google/fleurs dataset.

Ah ok, so I am understanding our WER on google/fleurs is better than the one reported by the script right? That's important! Also, the model is already very good at Spanish, I think that's why is difficult to improve it without overfitting, but for that we should have a way to measure it's performance but it seems that fleurs dataset has some problems.

Ah ok, so I am understanding our WER on google/fleurs is better than the one reported by the script right? > Yeah

I've been comparing also your best medium spanish model on the leaderboard, whisper-medium-es, and it looks like having done fine tuning only with common-voice-11 produces an overfitting. My model, whisper-medium-mix-es, was trained with cv11+fleurs+voxpouli+mls and it seems more robuts:
image.png

Oh yes, I think that model is very specialized in CV-11. I also fine-tuned other with CV+fleurs: juancopi81/whisper-medium-es-common-fleurs. It has better results in fleurs (I think 4,44) but a little worse in CV (6,07). I am unsure about the results in the other datasets, I haven't run the script. Also, since the other datasets had a different structure (uncased, punctuation), I did not add them, so it learns case and punctuation. How did you handle that?

I also notice that the script is using the transcription column and not the raw_string. I wanted the model to learnt case and punctuation, so I trained it using raw_string from fleurs plus CV. So I do not know if this is also that could affect the evaluation results in other datasets.

Sign up or log in to comment