hf-audio/open_asr_leaderboard · Smaller model sizes lead to worse RTF on Whisper

Jan 15

Thank you for your efforts on evaluation for various models, datasets. It will be a good reference on ASR tasks!

I am wondering why RTF values of Whisper "base" model is the best, followed by small, tiny model.
I thought the smaller model sizes lead to lower the RTF value.
If this results make sense, what would be the cases that larger model shows better RTF than small ones?

sanchit-gandhi

Hugging Face for Audio org Feb 11

Hey @lorenzopark , RTF is defined as:

total_processing_time / total_audio_time

In the case of the smaller Whisper models, we sometimes see instances of hallucinations or repeated passages of text. This means the total number of generated tokens is higher for these models, so while the processing time per-token is lower, the overall processing time is greater, giving an overall higher RTF. We're discussing how to make the RTF more robust by averaging over more audio data and removing the chunking paradigm on this GH issue. Feel free to have a read and see what you think - would be interested in knowing whether averaging the RTF over the 9 short-form datasets is something you think makes sense!

lorenzopark

Feb 13

Thank you for the explanation! That make sense to me now. I will check it out the issue too.

lorenzopark changed discussion status to closed Feb 13