chunk size is fixed could drop some information

#17
by jaxmetaverse - opened

because sounds are not always has fixed chunk size, if fix the chunk size and the semantic can not be split two chunk, the information will be dropped.

Whisper Distillation org

Hey @jaxmetaverse - there's a stride (overlap) equal to chunk_length / 6 that we use between chunks. This stride ensures that we get consistent transcriptions across chunks. For more details, refer to the blog post: Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers

thanks, The result I get is that large-v2 is better than distil-large-v2 by test this video "https://www.youtube.com/watch?v=xguam0TKMw8".
some env as follows:

  1. distil-large-v2 chunk_size is 30s .
  2. use whisper.cpp .

I try to test more times.

Whisper Distillation org
edited Nov 29, 2023

It's best to use chunk_length_s=15 for distil-large-v2 with return_timestamps=False:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

And chunk_length_s=30 for large-v2 with return_timestamps=True:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample, return_timestamps=True)
print(result["text"])

Sign up or log in to comment