Files changed (1) hide show
  1. README.md +31 -43
README.md CHANGED
@@ -172,10 +172,11 @@ pip install --upgrade pip
172
  pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
173
  ```
174
 
175
- ### Short-Form Transcription
176
-
177
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
178
- class to transcribe short-form audio files (< 30-seconds) as follows:
 
 
 
179
 
180
  ```python
181
  import torch
@@ -201,11 +202,14 @@ pipe = pipeline(
201
  tokenizer=processor.tokenizer,
202
  feature_extractor=processor.feature_extractor,
203
  max_new_tokens=128,
 
 
 
204
  torch_dtype=torch_dtype,
205
  device=device,
206
  )
207
 
208
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
209
  sample = dataset[0]["audio"]
210
 
211
  result = pipe(sample)
@@ -218,59 +222,43 @@ To transcribe a local audio file, simply pass the path to your audio file when y
218
  + result = pipe("audio.mp3")
219
  ```
220
 
221
- ### Long-Form Transcription
222
-
223
- Through Transformers Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
224
- is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
225
-
226
- To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
227
 
228
  ```python
229
- import torch
230
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
231
- from datasets import load_dataset
232
-
233
-
234
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
235
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
236
 
237
- model_id = "openai/whisper-large-v3"
 
238
 
239
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
240
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
241
- )
242
- model.to(device)
243
 
244
- processor = AutoProcessor.from_pretrained(model_id)
245
 
246
- pipe = pipeline(
247
- "automatic-speech-recognition",
248
- model=model,
249
- tokenizer=processor.tokenizer,
250
- feature_extractor=processor.feature_extractor,
251
- max_new_tokens=128,
252
- chunk_length_s=15,
253
- batch_size=16,
254
- torch_dtype=torch_dtype,
255
- device=device,
256
- )
257
 
258
- dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
259
- sample = dataset[0]["audio"]
260
 
261
- result = pipe(sample)
262
- print(result["text"])
 
263
  ```
264
 
265
- <!---
266
- **Tip:** The pipeline can also be used to transcribe an audio file from a remote URL, for example:
267
 
268
  ```python
269
- result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
 
270
  ```
271
- --->
272
 
273
- ### Speculative Decoding
274
 
275
  Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
276
  ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
 
172
  pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
173
  ```
174
 
 
 
175
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
176
+ class to transcribe audio files of arbitrary length. Transformers uses a chunked algorithm to transcribe
177
+ long-form audio files, which in-practice is 9x faster than the sequential algorithm proposed by OpenAI
178
+ (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). The batch size should
179
+ be set based on the specifications of your device:
180
 
181
  ```python
182
  import torch
 
202
  tokenizer=processor.tokenizer,
203
  feature_extractor=processor.feature_extractor,
204
  max_new_tokens=128,
205
+ chunk_length_s=30,
206
+ batch_size=16,
207
+ return_timestamps=True,
208
  torch_dtype=torch_dtype,
209
  device=device,
210
  )
211
 
212
+ dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
213
  sample = dataset[0]["audio"]
214
 
215
  result = pipe(sample)
 
222
  + result = pipe("audio.mp3")
223
  ```
224
 
225
+ Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
226
+ can be passed as an argument to the pipeline:
 
 
 
 
227
 
228
  ```python
229
+ result = pipe(sample, generate_kwargs={"language": "english"})
230
+ ```
 
 
 
 
 
231
 
232
+ By default, Whisper performs the task of *speech transcription*, where the source audio language is the same as the target
233
+ text language. To perform *speech translation*, where the target text is in English, set the task to `"translate"`:
234
 
235
+ ```python
236
+ result = pipe(sample, generate_kwargs={"task": "translate"})
237
+ ```
 
238
 
239
+ Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the `return_timestamps` argument:
240
 
241
+ ```python
242
+ result = pipe(sample, return_timestamps=True)
243
+ print(result["chunks"])
244
+ ```
 
 
 
 
 
 
 
245
 
246
+ And for word-level timestamps:
 
247
 
248
+ ```python
249
+ result = pipe(sample, return_timestamps="word")
250
+ print(result["chunks"])
251
  ```
252
 
253
+ The above arguments can be used in isolation or in combination. For example, to perform the task of speech transcription
254
+ where the source audio is in French, and we want to return sentence-level timestamps, the following can be used:
255
 
256
  ```python
257
+ result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
258
+ print(result["chunks"])
259
  ```
 
260
 
261
+ ## Speculative Decoding
262
 
263
  Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
264
  ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in