Transcribe audio longer than 30 seconds

When I run the model using the whisper model on an audio around 2 minutes, the output is truncated without the <|endoftext|> tag.

processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")


<|startoftranscript|> <|en|> <|transcribe|> null Well , she was very short . She was about five foot tall . So she always had this rather null null bou ff ant hairstyle , and you see , to give her a few extra inches , and very , very high heels , null null which she wore even first thing on a Sunday morning . And a terrifying mean , I think . null null I say all these things about her because as a the youngest child by some years after my older null null siblings , I was always kind of an observer of this , and a slightly am

Is there any way to transcribe the whole audio longer than 30 seconds?

Hi, How to use Pipeline to process long audio with non-English language ?

You should be able to do the following for Hindi:

import torch
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(

ds = load_dataset("common_voice", "hi", split="validation", streaming=True)
sample = next(iter(ds))["audio"]

prediction = pipe(sample.copy(), batch_size=8, generate_kwargs={"language": "hi", "task": "transcribe"})["text"]

# we can also return timestamps for the predictions
prediction = pipe(sample.copy(), batch_size=8, generate_kwargs={"language": "hi", "task": "transcribe"}, return_timestamps=True)["chunks"]

You can change the language and task arguments as required.

