Batch transcription

by daniel-v-e - opened Jan 8, 2023

Jan 8, 2023

Good day, is it maybe possible to somehow obtain the .pt model, similarly to the format the whisper models are in ~/.cache/whisper/, when doing 'pip install whisper' & running i.e. model = whisper.load_model('medium') ?

I ask because I'd like to perform batch transcription of longer documents, and running the model through huggingface only allows for up to max_new_tokens = 484 or something like that, requiring all input audio to be split.

If there is an alternative way to perform batch transcription, that would also be great.

Thanks!

GeoffVdr

Owner Jan 10, 2023

Hi @daniel-v-e ! Can't you use the pytorch_model.bin for that? I don't know, I have never tried. Maybe @sanchit-gandhi or @vb can help you with that.

sanchit-gandhi

Jan 13, 2023

•

edited Jan 13, 2023

Hey @daniel-v-e ! Here's a code snippet you can use to run 'streamed inference' with batching for audio samples of up to arbitrary length:

from datasets import load_dataset
import torch
from transformers import pipeline

dataset = load_dataset("mozilla-foundation/common_voice_11_0", "es", split="test", streaming=True)
# only for debugging, restricts the number of rows to numeric value in brackets -> remove for full testing
dataset = dataset.take(16)

# change to checkpoint and language of your choice
ckpt = "openai/whisper-tiny"
lang = "es"
device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model= ckpt,
    chunk_length_s=30,
    device=device,
)

pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")

def iterate_data(dataset):
    for i, item in enumerate(dataset):
        yield item["audio"]

# set the batch size in accordance to your device
BATCH_SIZE = 16

predictions = []

# run streamed inference
for out in pipe(iterate_data(dataset), batch_size=BATCH_SIZE):
    predictions.append(out["text"])

print(predictions)

Print Output:

[' Habritan aguas poco profundas y lo cosas.', ' Opera principalmente huelo de cargotajes y regionales de carga.', ' 3', ' le alicó los estudios primarias en plancle para continuar en luego en español', ' En los años que siguieron este trabajo es parte pero de un coducena de buenas jugadores.', ' propuso un nuevo marco para los territorios de alabagu y buscan a barra y bícay', ' ¿Cual es cierto? ¿Se está tratando de recuperar su proteón de asísina escanarias?', ' Estas críticas inciden en varios aspectos.', ' Fue se pultada en el cementario general de Santiago.', ' Si', ' Maite Perroni no ha assistido por estar grabando una telenovela.', ' Otras propulsieron que era una superpluma africada, la que causó la de la pronación del mante.', ' Es un cactus de fácil cultivo de crecimiento vigoroso y rápido.', ' Sus principales mercados son Estados Unidos y Talía, España y Japón.', ' El archetypo del enfoque artístico en situ es el arte urbano.', ' Fórdates y M de le ayudarán a Sabatín y a componer y para decir los ni tan isis.']

daniel-v-e

Jan 13, 2023

Thanks, but this still requires a long audio file to be split into 30second chunks before transcription? What I am looking for is a method to for example pass a 40min audio file to a fine-tuned Huggingface Whisper model, after following the steps in your blog post https://huggingface.co/blog/fine-tune-whisper (great blog by the way, super useful!). Surely it has to be possible, since running whisper via python / CLI somehow automatically splits the transcript?

sanchit-gandhi

Jan 16, 2023

Hey @daniel-v-e !

The audio samples are split into 30s chunks for two reasons:

The Whisper model is defined such that the inputs are always padded/truncated to 30s. Consequently, the model always expects audio samples of the same input length (30s). This is explained in more depth in the blog post (https://huggingface.co/blog/fine-tune-whisper#load-whisperfeatureextractor)
Due to the attention mechanism in the Transformer block, memory complexity scales with input length ^ 2, so doubling the audio input length quadruples the memory required by the encoder. The memory required quickly blows up as we increase the length of our audio sequences, hence the need to define a cut-off to prevent out-of-memory errors.

The only way really of transcribing audio samples > 30s is by chunking them into 30s samples, transcribing them individually, and stitching the transcriptions together at the boundaries. This is the same approach that is used by the 'official' Whisper CLI: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L175

Therefore, these are two equivalent approaches.

daniel-v-e

Jan 31, 2023

Hey @sanchit-gandhi ,

Thanks a lot for the clarification! That is actually perfect, since I can now split the audio into 30sec chunks exactly the way whisper does it, using the code you linked to. Much appreciated!

daniel-v-e changed discussion status to closed Jan 31, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment