Output Discrepancy

#82
by eBisw - opened

Hi,

I was trying Whisper based models in one of my projects. I have noticed something weird - the output is different from the HuggingFace implementation than the one from GitHub (https://github.com/openai/whisper) and from the example test UI on the right side of Model Card page. Why is the transcription quality so poor on the same file using the bellow approach while comparing to the other two methods?

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3")

Which made me think is there any preprocessing of the audio file involved or something else? It would be really nice if you could share some insight!

Maybe Beam size or other options differ?

https://github.com/SYSTRAN/faster-whisper

"If you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular:

Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, model.transcribe uses a default beam size of 1 but here we use a default beam size of 5.
When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable OMP_NUM_THREADS, which can be set when running your script:
OMP_NUM_THREADS=4 python3 my_script.py"

Sign up or log in to comment