Crashed: when getting end of sentence timestamps and word-level timestamps at the same time

#49
by liutian-sunshine - opened

It's reasonable to get both end of sentence timestamps and word-level timestamps at the same time.
I tried to call pipe() twice, using return_timestamps="word" and return_timestamps=True in this inelegant solution.
But it throws exception,
AttributeError: 'ModelOutput' object has no attribute 'numpy'

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
).to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

src_path = '2539595_hq.mp3'
words  = pipe(src_path,return_timestamps="word")
sentences = pipe(src_path)
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 3
      1 src_path = '2539595_hq.mp3'
      2 pipe(src_path,return_timestamps="word")
----> 3 pipe(src_path)

File ~/data/opt/virtualenv-python3.10/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py:357, in AutomaticSpeechRecognitionPipeline.__call__(self, inputs, **kwargs)
    294 def __call__(
    295     self,
    296     inputs: Union[np.ndarray, bytes, str],
    297     **kwargs,
    298 ):
    299     """
    300     Transcribe the audio sequence(s) given as inputs to text. See the [`AutomaticSpeechRecognitionPipeline`]
    301     documentation for more information.
   (...)
    355                 `"".join(chunk["text"] for chunk in output["chunks"])`.
    356     """
--> 357     return super().__call__(inputs, **kwargs)

File ~/data/opt/virtualenv-python3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1132, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1130     return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
   1131 elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> 1132     return next(
   1133         iter(
   1134             self.get_iterator(
   1135                 [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
   1136             )
   1137         )
   1138     )
   1139 else:
   1140     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/data/opt/virtualenv-python3.10/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:125, in PipelineIterator.__next__(self)
    123 # We're out of items within a batch
    124 item = next(self.iterator)
--> 125 processed = self.infer(item, **self.params)
    126 # We now have a batch of "inferred things".
    127 if self.loader_batch_size is not None:
    128     # Try to infer the size of the batch

File ~/data/opt/virtualenv-python3.10/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py:613, in AutomaticSpeechRecognitionPipeline.postprocess(self, model_outputs, decoder_kwargs, return_timestamps, return_language)
    611 stride = None
    612 for outputs in model_outputs:
--> 613     items = outputs[key].numpy()
    614     stride = outputs.get("stride", None)
    615     if stride is not None and self.type in {"ctc", "ctc_with_lm"}:

AttributeError: 'ModelOutput' object has no attribute 'numpy'
liutian-sunshine changed discussion title from how to get the end of sentence timestamps as well as word-level timestamps at the same time? to Crashed: when getting end of sentence timestamps and word-level timestamps at the same time

Were you able to solve this issue?

it's no necessary using in above way

@kennykang

import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3", **dict(task="transcribe",word_timestamps=True))
print(result)

then the result json format has words and sentences level timestamp info

liutian-sunshine changed discussion status to closed

@liutian-sunshine can you show us an example. I'm only getting sentence level timestamps when I do this

In the AutomaticSpeechRecognitionPipeline documentation:

For the Whisper model, timestamps can take one of two formats:
                    - `"word"`: same as above for word-level CTC timestamps. Word-level timestamps are predicted
                        through the *dynamic-time warping (DTW)* algorithm, an approximation to word-level timestamps
                        by inspecting the cross-attention weights.
                    - `True`: the pipeline will return timestamps along the text for *segments* of words in the text.
                        For instance, if you get `[{"text": " Hi there!", "timestamp": (0.5, 1.5)}]`, then it means the
                        model predicts that the segment "Hi there!" was spoken after `0.5` and before `1.5` seconds.
                        Note that a segment of text refers to a sequence of one or more words, rather than individual
                        words as with word-level timestamps.

Sign up or log in to comment