Transcribe audio longer than 30 seconds

#13
by xyang16 - opened

When I run the model using the whisper model on an audio around 2 minutes, the output is truncated without the <|endoftext|> tag.

processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")

Output:

<|startoftranscript|> <|en|> <|transcribe|> null Well , she was very short . She was about five foot tall . So she always had this rather null null bou ff ant hairstyle , and you see , to give her a few extra inches , and very , very high heels , null null which she wore even first thing on a Sunday morning . And a terrifying mean , I think . null null I say all these things about her because as a the youngest child by some years after my older null null siblings , I was always kind of an observer of this , and a slightly am

Is there any way to transcribe the whole audio longer than 30 seconds?

xyang16 changed discussion title from Transcribe audio longer than 30 second to Transcribe audio longer than 30 seconds

Hi, How to use Pipeline to process long audio with non-English language ?

You should be able to do the following for Hindi:

import torch
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-base",
  chunk_length_s=30,
  device=device,
)

ds = load_dataset("common_voice", "hi", split="validation", streaming=True)
sample = next(iter(ds))["audio"]

prediction = pipe(sample.copy(), batch_size=8, generate_kwargs={"language": "hi", "task": "transcribe"})["text"]

# we can also return timestamps for the predictions
prediction = pipe(sample.copy(), batch_size=8, generate_kwargs={"language": "hi", "task": "transcribe"}, return_timestamps=True)["chunks"]

You can change the language and task arguments as required.

Sign up or log in to comment