chunk size is fixed could drop some information
because sounds are not always has fixed chunk size, if fix the chunk size and the semantic can not be split two chunk, the information will be dropped.
Hey
@jaxmetaverse
- there's a stride (overlap) equal to chunk_length / 6
that we use between chunks. This stride ensures that we get consistent transcriptions across chunks. For more details, refer to the blog post: Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers
thanks, The result I get is that large-v2 is better than distil-large-v2 by test this video "https://www.youtube.com/watch?v=xguam0TKMw8".
some env as follows:
- distil-large-v2 chunk_size is 30s .
- use whisper.cpp .
I try to test more times.
It's best to use chunk_length_s=15
for distil-large-v2
with return_timestamps=False
:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=15,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
And chunk_length_s=30
for large-v2
with return_timestamps=True
:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample, return_timestamps=True)
print(result["text"])