contribute-branch

#12
by HiveerLi - opened
README.md CHANGED
@@ -23,24 +23,14 @@ It is a distilled version of the Whisper model that is **6 times faster**, 49% s
23
  **within 1% WER** on out-of-distribution evaluation sets. This is the repository for distil-large-v2,
24
  a distilled variant of [Whisper large-v2](https://huggingface.co/openai/whisper-large-v2).
25
 
26
- | Model | Params / M | Rel. Latency | Short-Form WER | Long-Form WER |
27
- |----------------------------------------------------------------------------|------------|----------------|------------------|-----------------|
28
- | [large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 | **8.4** | 11.0 |
29
- | [large-v2](https://huggingface.co/openai/whisper-large-v2) | 1550 | 1.0 | 9.1 | 11.7 |
30
- | | | | | |
31
- | [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3) | 756 | 6.3 | 9.7 | **10.8** |
32
- | [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2) | 756 | 5.8 | 10.1 | 11.6 |
33
- | [distil-medium.en](https://huggingface.co/distil-whisper/distil-medium.en) | 394 | **6.8** | 11.1 | 12.4 |
34
- | [distil-small.en](https://huggingface.co/distil-whisper/distil-small.en) | **166** | 5.6 | 12.1 | 12.8 |
35
-
36
- <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400">
37
- <p><b>Update:</b> following the release of OpenAI's Whisper large-v3, an updated <a href="ttps://huggingface.co/distil-whisper/distil-large-v3"> distil-large-v3</a> model was published. This <a href="ttps://huggingface.co/distil-whisper/distil-large-v3"> distil-large-v3</a> model surpasses the performance of the distil-large-v2 model, with no architecture changes and better support for sequential long-form generation. Thus, it is recommended that the <a href="ttps://huggingface.co/distil-whisper/distil-large-v3"> distil-large-v3</a> model is used in-place of the large-v2 model. </p>
38
- </div>
39
-
40
- **Note:** Distil-Whisper is currently only available for English speech recognition. We are working with the community
41
- to distill Whisper on other languages. If you are interested in distilling Whisper in your language, check out the
42
- provided [training code](https://github.com/huggingface/distil-whisper/tree/main/training). We will update the
43
- [Distil-Whisper repository](https://github.com/huggingface/distil-whisper/) with multilingual checkpoints when ready!
44
 
45
  ## Usage
46
 
@@ -56,7 +46,7 @@ pip install --upgrade transformers accelerate datasets[audio]
56
  ### Short-Form Transcription
57
 
58
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
59
- class to transcribe short-form audio files (< 30-seconds) as follows:
60
 
61
  ```python
62
  import torch
@@ -101,7 +91,7 @@ To transcribe a local audio file, simply pass the path to your audio file when y
101
 
102
  ### Long-Form Transcription
103
 
104
- Distil-Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
105
  is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
106
 
107
  To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds
@@ -154,9 +144,9 @@ result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/r
154
 
155
  ### Speculative Decoding
156
 
157
- Distil-Whisper can be used as an assistant model to Whisper for [speculative decoding](https://huggingface.co/blog/whisper-speculative-decoding).
158
- Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster.
159
- This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.
160
 
161
  In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
162
  specify it as the "assistant model" for generation:
@@ -239,72 +229,21 @@ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dt
239
  + model = model.to_bettertransformer()
240
  ```
241
 
242
- ### Running Distil-Whisper in `openai-whisper`
243
-
244
- To use the model in the original Whisper format, first ensure you have the [`openai-whisper`](https://pypi.org/project/openai-whisper/) package installed:
245
-
246
- ```bash
247
- pip install --upgrade openai-whisper
248
- ```
249
-
250
- The following code-snippet demonstrates how to transcribe a sample file from the LibriSpeech dataset loaded using
251
- 🤗 Datasets:
252
-
253
- ```python
254
- import torch
255
- from datasets import load_dataset
256
- from huggingface_hub import hf_hub_download
257
- from whisper import load_model, transcribe
258
-
259
- distil_large_v2 = hf_hub_download(repo_id="distil-whisper/distil-large-v2", filename="original-model.bin")
260
- model = load_model(distil_large_v2)
261
-
262
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
263
- sample = dataset[0]["audio"]["array"]
264
- sample = torch.from_numpy(sample).float()
265
 
266
- pred_out = transcribe(model, audio=sample)
267
- print(pred_out["text"])
268
- ```
269
 
270
- To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe:
271
 
272
- ```python
273
- pred_out = transcribe(model, audio="audio.mp3")
274
- ```
275
 
276
  ### Whisper.cpp
277
 
278
- Distil-Whisper can be run from the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) repository with the original
279
- sequential long-form transcription algorithm. In a [provisional benchmark](https://github.com/ggerganov/whisper.cpp/pull/1424#issuecomment-1793513399)
280
- on Mac M1, `distil-large-v2` is 2x faster than `large-v2`, while performing to within 0.1% WER over long-form audio.
281
-
282
- Note that future releases of Distil-Whisper will target faster CPU inference more! By distilling smaller encoders, we
283
- aim to achieve similar speed-ups to what we obtain on GPU.
284
-
285
- Steps for getting started:
286
- 1. Clone the Whisper.cpp repository:
287
- ```
288
- git clone https://github.com/ggerganov/whisper.cpp.git
289
- cd whisper.cpp
290
- ```
291
- 2. Download the ggml weights for `distil-medium.en` from the Hugging Face Hub:
292
-
293
- ```bash
294
- python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='distil-whisper/distil-large-v2', filename='ggml-large-32-2.en.bin', local_dir='./models')"
295
- ```
296
-
297
- Note that if you do not have the `huggingface_hub` package installed, you can also download the weights with `wget`:
298
-
299
- ```bash
300
- wget https://huggingface.co/distil-whisper/distil-large-v2/resolve/main/ggml-large-32-2.en.bin -P ./models
301
- ```
302
 
303
- 3. Run inference using the provided sample audio:
304
 
305
- ```bash
306
- make -j && ./main -m models/ggml-large-32-2.en.bin -f samples/jfk.wav
307
- ```
308
 
309
 
310
  ### Transformers.js
@@ -323,43 +262,6 @@ See the [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_
323
 
324
  *Note:* Due to the large model size, we recommend running this model server-side with [Node.js](https://huggingface.co/docs/transformers.js/guides/node-audio-processing) (instead of in-browser).
325
 
326
- ### Candle
327
-
328
- Through an integration with Hugging Face [Candle](https://github.com/huggingface/candle/tree/main) 🕯️, Distil-Whisper is
329
- now available in the Rust library 🦀
330
-
331
- Benefit from:
332
- * Optimised CPU backend with optional MKL support for x86 and Accelerate for Macs
333
- * CUDA backend for efficiently running on GPUs, multiple GPU distribution via NCCL
334
- * WASM support: run Distil-Whisper in a browser
335
-
336
- Steps for getting started:
337
- 1. Install [`candle-core`](https://github.com/huggingface/candle/tree/main/candle-core) as explained [here](https://huggingface.github.io/candle/guide/installation.html)
338
- 2. Clone the `candle` repository locally:
339
- ```
340
- git clone https://github.com/huggingface/candle.git
341
- ```
342
- 3. Enter the example directory for [Whisper](https://github.com/huggingface/candle/tree/main/candle-examples/examples/whisper):
343
- ```
344
- cd candle/candle-examples/examples/whisper
345
- ```
346
- 4. Run an example:
347
- ```
348
- cargo run --example whisper --release -- --model distil-large-v2
349
- ```
350
- 5. To specify your own audio file, add the `--input` flag:
351
- ```
352
- cargo run --example whisper --release -- --model distil-large-v2 --input audio.wav
353
- ```
354
-
355
- ### 8bit & 4bit Quantization
356
-
357
- Coming soon ...
358
-
359
- ### Whisper.cpp
360
-
361
- Coming soon ...
362
-
363
  ## Model Details
364
 
365
  Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector
@@ -516,12 +418,7 @@ where it performs to within 0.2% WER of Whisper.
516
 
517
  ## Reproducing Distil-Whisper
518
 
519
- Training and evaluation code to reproduce Distil-Whisper is available under the Distil-Whisper repository: https://github.com/huggingface/distil-whisper/tree/main/training
520
-
521
-
522
- ## License
523
-
524
- Distil-Whisper inherits the [MIT license](https://github.com/huggingface/distil-whisper/blob/main/LICENSE) from OpenAI's Whisper model.
525
 
526
  ## Citation
527
 
 
23
  **within 1% WER** on out-of-distribution evaluation sets. This is the repository for distil-large-v2,
24
  a distilled variant of [Whisper large-v2](https://huggingface.co/openai/whisper-large-v2).
25
 
26
+ | Model | Params / M | Rel. Latency | Short-Form WER | Long-Form WER |
27
+ |----------------------------------------------------------------------------|------------|--------------|----------------|---------------|
28
+ | [large-v2](https://huggingface.co/openai/whisper-large-v2) | 1550 | 1.0 | **9.1** | 11.7 |
29
+ | | | | | |
30
+ | [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2) | 756 | 5.8 | 10.1 | **11.6** |
31
+ | [distil-medium.en](https://huggingface.co/distil-whisper/distil-medium.en) | **394** | **6.8** | 11.1 | 12.4 |
32
+
33
+ **Note:** Distil-Whisper is currently only available for English speech recognition. Multilingual support will be provided in a follow-up.
 
 
 
 
 
 
 
 
 
 
34
 
35
  ## Usage
36
 
 
46
  ### Short-Form Transcription
47
 
48
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
49
+ class to transcribe short-form audio files as follows:
50
 
51
  ```python
52
  import torch
 
91
 
92
  ### Long-Form Transcription
93
 
94
+ Distil-Whisper uses a chunked algorithm to transcribe long-form audio files. In practice, this chunked long-form algorithm
95
  is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
96
 
97
  To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds
 
144
 
145
  ### Speculative Decoding
146
 
147
+ Distil-Whisper can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
148
+ ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
149
+ replacement for existing Whisper pipelines, since the same outputs are guaranteed.
150
 
151
  In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
152
  specify it as the "assistant model" for generation:
 
229
  + model = model.to_bettertransformer()
230
  ```
231
 
232
+ ### 8bit & 4bit Quantization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
233
 
234
+ Coming soon ...
 
 
235
 
236
+ ### Candle
237
 
238
+ Coming soon ...
 
 
239
 
240
  ### Whisper.cpp
241
 
242
+ Coming soon ...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
 
244
+ ### Running Whisper in `openai/whisper`
245
 
246
+ Coming soon ...
 
 
247
 
248
 
249
  ### Transformers.js
 
262
 
263
  *Note:* Due to the large model size, we recommend running this model server-side with [Node.js](https://huggingface.co/docs/transformers.js/guides/node-audio-processing) (instead of in-browser).
264
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
265
  ## Model Details
266
 
267
  Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector
 
418
 
419
  ## Reproducing Distil-Whisper
420
 
421
+ Training and evaluation code to reproduce Distil-Whisper will be made available on the Distil-Whisper repository: https://github.com/huggingface/distil-whisper
 
 
 
 
 
422
 
423
  ## Citation
424
 
generation_config.json CHANGED
@@ -123,11 +123,10 @@
123
  "<|zh|>": 50260
124
  },
125
  "language": "<|en|>",
126
- "max_initial_timestamp_index": 50,
127
  "max_length": 448,
128
  "no_timestamps_token_id": 50363,
129
  "pad_token_id": 50257,
130
- "prev_sot_token_id": 50361,
131
  "return_timestamps": false,
132
  "suppress_tokens": [
133
  1,
 
123
  "<|zh|>": 50260
124
  },
125
  "language": "<|en|>",
126
+ "max_initial_timestamp_index": 1,
127
  "max_length": 448,
128
  "no_timestamps_token_id": 50363,
129
  "pad_token_id": 50257,
 
130
  "return_timestamps": false,
131
  "suppress_tokens": [
132
  1,
original-model.bin → original-large-32-2-en.bin RENAMED
File without changes
original-model.fp32.bin → original-large-32-2.fp32.bin RENAMED
File without changes