Error with word level timestamps - ValueError: set return_segments=True
#60
by
dkincaid
- opened
I'm getting an error when I try to set return_timestamps='word'. It seems to want me to set a parameter 'return_segments=True', but when I try to do that I get a different error that the parameter is not valid.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
#max_new_tokens=128,
#chunk_length_s=30,
#batch_size=16,
torch_dtype=torch_dtype,
device=device,
generate_kwargs={"language": "english"}
)
result = pipe(audio_file_path, return_timestamps="word", generate_kwargs={"language": "english"})
ValueError: Make sure to set `return_segments=True` to return generation outputs as part of the `'segments' key.`
and if I set return_segments=True
I get this error:
TypeError: AutomaticSpeechRecognitionPipeline._sanitize_parameters() got an unexpected keyword argument 'return_segments'
anyone else run into this and figure out how to fix it?
You should use return_segments param in generate_kwargs
Well, that doesn't work either. That generates this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-14-b4642e8f26a4> in <cell line: 2>()
----> 1 pipe(audio_file, return_timestamps="word", generate_kwargs={"return_segments": True})
5 frames
/usr/local/lib/python3.10/dist-packages/transformers/pipelines/automatic_speech_recognition.py in _forward(self, model_inputs, return_timestamps, generate_kwargs)
575 )
576 if return_timestamps == "word" and self.type == "seq2seq_whisper":
--> 577 out = {"tokens": tokens["sequences"], "token_timestamps": tokens["token_timestamps"]}
578 else:
579 out = {"tokens": tokens}
KeyError: 'token_timestamps'
Hey
@dkincaid
Try with batch_size=1
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
#max_new_tokens=128,
#chunk_length_s=30,
batch_size=1,
torch_dtype=torch_dtype,
device=device,
generate_kwargs={"language": "english"}
)
result = pipe(audio_file_path, return_timestamps="word", generate_kwargs={"language": "english"})
Try uncommenting or including 'chunk_length_s=30'. Your audio file may be longer than 30 seconds which becomes a problem when using return_timestamps="word".