openai/whisper-large-v3 · Incoherences in timestamps between chunked and sequential form

Apr 10, 2024

Hi, I'm currently playing with Whisper V3 deployed on sagemaker instances to see if it fits my needs.
I'm understanding that there are two ways of dealing with longer audios:
-By using the chunk_length_s parameter of the automatic-speech-recognition pipeline, the algorithm deals with chunks of audio parallely (with the batch_size parameter if I'm not mistaken). This solution is faster but each part of the audio has less context and in my tests, it performs a bit worse.
-By not using the chunk_length_s parameter (default value of 0), the algorithm deals with audio sequentially.

My issue is the following:
When I used the chunked/parallel algorithm and look at sentence timestamps (return_timestamps=True), the timestamps are continuous from 0s to the end of the audio file. However if I use the default sequential algorithm, i get timestamps that reset at 0s every 30 seconds.

I would have expected this to be the other way around. It does not make much sense to me and so I'm wondering if I'm making a mistake.

Timestamps are especially important for me since I'm dealing with audio where each channel is a speaker and I want to run Whisper on each channel before merging both transcription into one file (and thus I need timestamps to keep the order of each sentence between the two speakers/channels).

If anyone knew the reason behind that or could help me understand better what is going on, it would be greatly appreciated.

Have a nice day,

MaksimGorkii

May 7, 2024

Did you find any solution to that problem? I am in the same boat

NicholasGri

May 30, 2024

Hello, did you find the solution?

Pablogps

Jun 12, 2024

I'm also finding this issue. Also the resets are not exactly after each 30 seconds, but rather after the end of the last chunk, which is roughly 30 seconds (but may be a bit shorter, probably because the splits looks for silence?). After each reset the timestamps always start at 0, even if it's a short phrase.

danny122001

Sep 11, 2024

You can use the openai code version. That supports segment level timestamping without the 30 second breaks.

!pip install openai-whisper

import whisper
model = whisper.load_model('large')

def get_transcribe(audio: str, language: str = 'en'):
return model.transcribe(audio=audio, language=language, verbose=True)