sanchit-gandhi
commited on
Commit
·
299b03b
1
Parent(s):
79aace4
Update README.md
Browse files
README.md
CHANGED
@@ -351,8 +351,8 @@ This code snippet shows how to evaluate Whisper Tiny on [LibriSpeech test-clean]
|
|
351 |
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
|
352 |
algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
|
353 |
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
354 |
-
method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline.
|
355 |
-
predict
|
356 |
|
357 |
```python
|
358 |
>>> import torch
|
@@ -363,7 +363,7 @@ predict utterance level timestamps by passing `return_timestamps=True`:
|
|
363 |
|
364 |
>>> pipe = pipeline(
|
365 |
>>> "automatic-speech-recognition",
|
366 |
-
>>> model="openai/whisper-
|
367 |
>>> chunk_length_s=30,
|
368 |
>>> device=device,
|
369 |
>>> )
|
@@ -371,15 +371,17 @@ predict utterance level timestamps by passing `return_timestamps=True`:
|
|
371 |
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
372 |
>>> sample = ds[0]["audio"]
|
373 |
|
374 |
-
>>> prediction = pipe(sample.copy())["text"]
|
375 |
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
|
376 |
|
377 |
>>> # we can also return timestamps for the predictions
|
378 |
-
>>> prediction = pipe(sample, return_timestamps=True)["chunks"]
|
379 |
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
|
380 |
'timestamp': (0.0, 5.44)}]
|
381 |
```
|
382 |
|
|
|
|
|
383 |
## Fine-Tuning
|
384 |
|
385 |
The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
|
|
|
351 |
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
|
352 |
algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
|
353 |
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
354 |
+
method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
|
355 |
+
can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
|
356 |
|
357 |
```python
|
358 |
>>> import torch
|
|
|
363 |
|
364 |
>>> pipe = pipeline(
|
365 |
>>> "automatic-speech-recognition",
|
366 |
+
>>> model="openai/whisper-large-v2",
|
367 |
>>> chunk_length_s=30,
|
368 |
>>> device=device,
|
369 |
>>> )
|
|
|
371 |
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
372 |
>>> sample = ds[0]["audio"]
|
373 |
|
374 |
+
>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
|
375 |
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
|
376 |
|
377 |
>>> # we can also return timestamps for the predictions
|
378 |
+
>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
|
379 |
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
|
380 |
'timestamp': (0.0, 5.44)}]
|
381 |
```
|
382 |
|
383 |
+
Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
|
384 |
+
|
385 |
## Fine-Tuning
|
386 |
|
387 |
The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
|