distil-whisper/distil-large-v2 · chunk size is fixed could drop some information

Nov 25, 2023

•

edited Nov 25, 2023

because sounds are not always has fixed chunk size, if fix the chunk size and the semantic can not be split two chunk, the information will be dropped.

sanchit-gandhi

Whisper Distillation org Nov 27, 2023

Hey @jaxmetaverse - there's a stride (overlap) equal to chunk_length / 6 that we use between chunks. This stride ensures that we get consistent transcriptions across chunks. For more details, refer to the blog post: Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers

jaxmetaverse

Nov 29, 2023

•

edited Nov 29, 2023

thanks, The result I get is that large-v2 is better than distil-large-v2 by test this video "https://www.youtube.com/watch?v=xguam0TKMw8".
some env as follows:

distil-large-v2 chunk_size is 30s .
use whisper.cpp .

I try to test more times.

sanchit-gandhi

Whisper Distillation org Nov 29, 2023

•

edited Nov 29, 2023

It's best to use chunk_length_s=15 for distil-large-v2 with return_timestamps=False:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

And chunk_length_s=30 for large-v2 with return_timestamps=True:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample, return_timestamps=True)
print(result["text"])