openai
/

whisper-large-v3

Automatic Speech Recognition

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

Output Discrepancy

#82

by eBisw - opened Feb 23, 2024

eBisw

Feb 23, 2024

Hi,

I was trying Whisper based models in one of my projects. I have noticed something weird - the output is different from the HuggingFace implementation than the one from GitHub (https://github.com/openai/whisper) and from the example test UI on the right side of Model Card page. Why is the transcription quality so poor on the same file using the bellow approach while comparing to the other two methods?

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3")

Which made me think is there any preprocessing of the audio file involved or something else? It would be really nice if you could share some insight!

bgilles

Mar 13, 2024

Maybe Beam size or other options differ?

https://github.com/SYSTRAN/faster-whisper

"If you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular:

Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, model.transcribe uses a default beam size of 1 but here we use a default beam size of 5.
When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable OMP_NUM_THREADS, which can be set when running your script:
OMP_NUM_THREADS=4 python3 my_script.py"

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment