Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
sanchit-gandhi 
posted an update Mar 5
Post
Why does returning timestamps help Whisper reduce hallucinations? 🧐

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:
The cat sat on the on the on the mat.

Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:
<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:
<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?

I have seen such hallucination in Hindi, Gujarati language also.

  1. Is this solution helps for low-resource language like Indian languages?
  2. I have fine-tuned the whisper model for indian languages without time-stamp prediction. Then still this solution works or we need to fine-tuned the whisper model with time-stamp prediction ?
·
  1. Yes, it should be language agnostic
  2. You would need to repeat fine-tuning your model, this time in a way that preserves timestamps. If you have timestamps in your target data, you can continue using these. If you don't have timestamps in your data, you can try training with LoRA. Using LoRAs reduces the amount of catastrophic forgetting, so even though we don't have timestamps in our fine-tuning data, the model remembers how to make timestamp'd predictions. You can see a guide on LoRA fine-tuning using the PEFT library here. Note that you want to run inference in half/full precision (not 8-bit), as outlined here

Note that the original post is a hypothesis for why timestamps reduces hallucinations. It would need to be tested and evaluated to confirm whether these findings hold more generally!

Bro i need your help
I saw your project on https://huggingface.co/spaces/sanchit-gandhi/whisper-jax
I am student and i use it for translating videos and i have exams upcoming next week
I want to translate a video which is 4 hours long and it says it is not supported so i spilt the video into 2 parts which are under 2 hours and when i upload the video it shows https://huggingface.co/spaces/sanchit-gandhi/whisper-jax i have video downloaded in my device using y2mate i even tried mp.4,mp.3 formats i no nothing about coding...etc i am a student i wait for your help i am using mobile