Error with word level timestamps - ValueError: set return_segments=True

#60
by dkincaid - opened

I'm getting an error when I try to set return_timestamps='word'. It seems to want me to set a parameter 'return_segments=True', but when I try to do that I get a different error that the parameter is not valid.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    #max_new_tokens=128,
    #chunk_length_s=30,
    #batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"language": "english"}
)

result = pipe(audio_file_path, return_timestamps="word", generate_kwargs={"language": "english"})
ValueError: Make sure to set `return_segments=True` to return generation outputs as part of the `'segments' key.`

and if I set return_segments=True I get this error:

TypeError: AutomaticSpeechRecognitionPipeline._sanitize_parameters() got an unexpected keyword argument 'return_segments'

anyone else run into this and figure out how to fix it?

You should use return_segments param in generate_kwargs

Well, that doesn't work either. That generates this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-b4642e8f26a4> in <cell line: 2>()
----> 1 pipe(audio_file, return_timestamps="word", generate_kwargs={"return_segments": True})

5 frames
/usr/local/lib/python3.10/dist-packages/transformers/pipelines/automatic_speech_recognition.py in _forward(self, model_inputs, return_timestamps, generate_kwargs)
    575             )
    576             if return_timestamps == "word" and self.type == "seq2seq_whisper":
--> 577                 out = {"tokens": tokens["sequences"], "token_timestamps": tokens["token_timestamps"]}
    578             else:
    579                 out = {"tokens": tokens}

KeyError: 'token_timestamps'

Hey @dkincaid
Try with batch_size=1

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    #max_new_tokens=128,
    #chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"language": "english"}
)

result = pipe(audio_file_path, return_timestamps="word", generate_kwargs={"language": "english"})

Hello @dkincaid
I encountered the same problem. Have you solved it? Thank you!

Try uncommenting or including 'chunk_length_s=30'. Your audio file may be longer than 30 seconds which becomes a problem when using return_timestamps="word".

Sign up or log in to comment