openai/whisper-large-v2 · return Timestamps="word" not working correctly

Hey @sanchit-gandhi ,

I am trying to finetune a Whisper-V2 model.
While testing out the finetuned version, I found that for an audio, the model outputs timestamps upto a particular duration and then, although the text for the remaining duration is predicted correctly, the word timestamps are not returned.

My observation is that the point at which the model stops outputting chunks is when a special character is outputted.
I verified the audio, it is not corrupted. Moreover, I also tried splitting the audio file into 2 parts, the splitting point being 20-30 secs before this issue occurs to the end of the audio. If I pass only the split remaining audio, it again gives word timestamps.

Would appreciate your advice regarding this issue!

Code

pipeline = pipeline(
"automatic-speech-recognition",
model=MODEL_ID,
generate_kwargs={"language": "<|gu|>", "task": "translate"},
chunk_length_s=30,
return_timestamps="word",
device=0,
)

finetuned_output = pipeline('/content/test_audio.mp3')

Sample Output

{"text": "The command is more important. In this way, He gives a sweet reprimand. At that time, He could have had an argument, He could have had a defence, He could have broken His head, bloodied Himself and said, that it is not my fault, this is not the way, this is a wrong complaint, He could have doubted or spoken against Him. � He only experienced the benevolence. Because true discipleship had manifested. " }

{"chunks": [{"text": "The command is more important.", "timestamp": [4443.82, 4445.76]}, {"text": "In this way, He gives a sweet reprimand.", "timestamp": [4445.76, 4448.24]}, {"text": "At that time, He could have had an argument,", "timestamp": [4448.24, 4450.92]}, {"text": "He could have had a defence,", "timestamp": [4450.92, 4452.94]}, {"text": "He could have broken His head, bloodied Himself and said,", "timestamp": [4452.94, 4456.6]}, {"text": "that it is not my fault, this is not the way, this is a wrong complaint,", "timestamp": [4456.6, 4460.58]}, {"text": "He could have doubted or spoken against Him.", "timestamp": [4460.58, 4462.68]}, {"text": "", "timestamp": [4462.68, 4463.44]}]}]}

Note: The above is just a sample output (Actual audio file is over 1.5 hrs long). In this sample, the last 2-3 sentences are inferred perfectly in text but their word timestamp isn't predicted. And all sentences following this in text don't have predicted word timestamp..