openai/whisper-large-v2 · About length of audio

Hey @RaysDipesh . The behaviour you've encountered here is the way the Whisper model gets around dealing with padded/truncated inputs: all input audios are padded/truncated to 30 seconds, regardless of their length, before being converted to log-mel spectrogram inputs. The model is then trained without an attention mask. Instead, it learns to ignore the padded inputs from the spectrogram inputs directly.

At inference time, we have to match the paradigm the model was trained on, i.e. always pad/truncate audios to 30 seconds. This is why the feature extractor and positional embeddings always expect log-mel spectrograms with a sequence length of 1500, which corresponds to 30 seconds of audio input. You'll find that the OpenAI Whisper implementation also forces the inputs to always be 30 seconds. The Transformers' implementation thus matches this for strict one-to-one equivalence.

The way we can run the model on longer audio samples is by chunking them into smaller 30 second chunks, and then running inference to get the chunked transcriptions. Here's a code snippet on how you can achieve this in Transformers:

import torch
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large-v2",
  chunk_length_s=30,
  device=device,
)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]

prediction = pipe(sample.copy(), batch_size=8)["text"]

# we can also return timestamps for the predictions
prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]