Models only transcript a small part and then gives up

#3
by IDontKnowWhatToNameMyself - opened

It's happened repeatedly across all models where it just transcribes a small part of my audio. Like less than a fraction of what I recorded. It may be because of the audio quality because I'm recording things with my phone, but it's really weird how it just doesn't even try with things that sound obvious in the audio.

Hey @IDontKnowWhatToNameMyself ! The Whisper model works in 30s batches: any audio samples longer than 30s are truncated (cut-short) by the model (see this blog post for details: https://huggingface.co/blog/fine-tune-whisper#load-whisperfeatureextractor)

Adding long-form transcription to handle audio samples > 30s is a TODO! See here: https://github.com/huggingface/transformers/issues/19887

It's weird because my audio samples are less than 30 seconds and it only does like 2 words of it

Do you have code to reproduce? Or is this done through the Hosted Inference API?

I'm using the hosted inference API

Could be a quirk of the Whisper model that it behaves badly with low qual audio.

Here's what we could do! Could you use the same inputs and pass them through the Whisper model here: https://huggingface.co/spaces/openai/whisper

Once the predictions have been generated, you can share them in the community tab (input audio + predictions)

You can tag me in the comment once you've done that!

That would help in trying to uncover why the model stops so early

Could be a quirk of the Whisper model that it behaves badly with low qual audio.

Here's what we could do! Could you use the same inputs and pass them through the Whisper model here: https://huggingface.co/spaces/openai/whisper

Once the predictions have been generated, you can share them in the community tab (input audio + predictions)

You can tag me in the comment once you've done that!

That would help in trying to uncover why the model stops so early

I actually already posted something in the community tab and it was the same issue.
https://huggingface.co/spaces/openai/whisper/discussions/50#63743abb7da0b794a582366e

Thanks, replied there!

In short, the model terminates the generation process when it reaches the long period of no speech between the two spoken sentences.

IDontKnowWhatToNameMyself changed discussion status to closed

Sign up or log in to comment