Crashed: when getting end of sentence timestamps and word-level timestamps at the same time
It's reasonable to get both end of sentence timestamps and word-level timestamps at the same time.
I tried to call pipe() twice, using return_timestamps="word" and return_timestamps=True in this inelegant solution.
But it throws exception,
AttributeError: 'ModelOutput' object has no attribute 'numpy'
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
).to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=1,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
src_path = '2539595_hq.mp3'
words = pipe(src_path,return_timestamps="word")
sentences = pipe(src_path)
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[2], line 3
1 src_path = '2539595_hq.mp3'
2 pipe(src_path,return_timestamps="word")
----> 3 pipe(src_path)
File ~/data/opt/virtualenv-python3.10/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py:357, in AutomaticSpeechRecognitionPipeline.__call__(self, inputs, **kwargs)
294 def __call__(
295 self,
296 inputs: Union[np.ndarray, bytes, str],
297 **kwargs,
298 ):
299 """
300 Transcribe the audio sequence(s) given as inputs to text. See the [`AutomaticSpeechRecognitionPipeline`]
301 documentation for more information.
(...)
355 `"".join(chunk["text"] for chunk in output["chunks"])`.
356 """
--> 357 return super().__call__(inputs, **kwargs)
File ~/data/opt/virtualenv-python3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1132, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1130 return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
1131 elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> 1132 return next(
1133 iter(
1134 self.get_iterator(
1135 [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
1136 )
1137 )
1138 )
1139 else:
1140 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File ~/data/opt/virtualenv-python3.10/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:125, in PipelineIterator.__next__(self)
123 # We're out of items within a batch
124 item = next(self.iterator)
--> 125 processed = self.infer(item, **self.params)
126 # We now have a batch of "inferred things".
127 if self.loader_batch_size is not None:
128 # Try to infer the size of the batch
File ~/data/opt/virtualenv-python3.10/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py:613, in AutomaticSpeechRecognitionPipeline.postprocess(self, model_outputs, decoder_kwargs, return_timestamps, return_language)
611 stride = None
612 for outputs in model_outputs:
--> 613 items = outputs[key].numpy()
614 stride = outputs.get("stride", None)
615 if stride is not None and self.type in {"ctc", "ctc_with_lm"}:
AttributeError: 'ModelOutput' object has no attribute 'numpy'
Were you able to solve this issue?
it's no necessary using in above way
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3", **dict(task="transcribe",word_timestamps=True))
print(result)
then the result json format has words and sentences level timestamp info
@liutian-sunshine can you show us an example. I'm only getting sentence level timestamps when I do this
In the AutomaticSpeechRecognitionPipeline documentation:
For the Whisper model, timestamps can take one of two formats:
- `"word"`: same as above for word-level CTC timestamps. Word-level timestamps are predicted
through the *dynamic-time warping (DTW)* algorithm, an approximation to word-level timestamps
by inspecting the cross-attention weights.
- `True`: the pipeline will return timestamps along the text for *segments* of words in the text.
For instance, if you get `[{"text": " Hi there!", "timestamp": (0.5, 1.5)}]`, then it means the
model predicts that the segment "Hi there!" was spoken after `0.5` and before `1.5` seconds.
Note that a segment of text refers to a sequence of one or more words, rather than individual
words as with word-level timestamps.