sanchit-gandhi HF staff commited on
Commit
1f66457
1 Parent(s): 8be909d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -309,8 +309,8 @@ This code snippet shows how to evaluate Whisper Large on [LibriSpeech test-clean
309
  The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
310
  algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
311
  [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
312
- method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. It can also be extended to
313
- predict utterance level timestamps by passing `return_timestamps=True`:
314
 
315
  ```python
316
  >>> import torch
@@ -329,15 +329,17 @@ predict utterance level timestamps by passing `return_timestamps=True`:
329
  >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
330
  >>> sample = ds[0]["audio"]
331
 
332
- >>> prediction = pipe(sample.copy())["text"]
333
  " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
334
 
335
  >>> # we can also return timestamps for the predictions
336
- >>> prediction = pipe(sample, return_timestamps=True)["chunks"]
337
  [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
338
  'timestamp': (0.0, 5.44)}]
339
  ```
340
 
 
 
341
  ## Fine-Tuning
342
 
343
  The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
 
309
  The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
310
  algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
311
  [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
312
+ method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
313
+ can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
314
 
315
  ```python
316
  >>> import torch
 
329
  >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
330
  >>> sample = ds[0]["audio"]
331
 
332
+ >>> prediction = pipe(sample.copy(), batch_size=8)["text"]
333
  " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
334
 
335
  >>> # we can also return timestamps for the predictions
336
+ >>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
337
  [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
338
  'timestamp': (0.0, 5.44)}]
339
  ```
340
 
341
+ Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
342
+
343
  ## Fine-Tuning
344
 
345
  The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,