Automatic Speech Recognition
Transformers
Safetensors
Japanese
whisper
audio
hf-asr-leaderboard
Eval Results
Inference Endpoints
asahi417 commited on
Commit
9a370b7
•
1 Parent(s): 31db69b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +598 -0
README.md CHANGED
@@ -14,3 +14,601 @@ widget:
14
  src: https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
15
  pipeline_tag: automatic-speech-recognition
16
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  src: https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
15
  pipeline_tag: automatic-speech-recognition
16
  ---
17
+
18
+ # Kotoba-Whisper
19
+
20
+ # Distil-Whisper: distil-large-v3
21
+
22
+ Distil-Whisper was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
23
+
24
+ This is the third and final installment of the Distil-Whisper English series. It the knowledge distilled version of
25
+ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3), the latest and most performant Whisper model
26
+ to date.
27
+
28
+ Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
29
+ **superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**.
30
+
31
+ The result is a distilled model that performs to within 1% WER of large-v3 on long-form audio using both the sequential
32
+ and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster
33
+ than previous Distil-Whisper models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.
34
+
35
+ | Model | Params / M | Rel. Latency | Short-Form | Sequential Long-Form | Chunked Long-Form |
36
+ |------------------------------------------------------------------------------|------------|--------------|------------|----------------------|-------------------|
37
+ | [large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 | 8.4 | 10.0 | 11.0 |
38
+ | **[distil-large-v3](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)** | **756** | **6.3** | **9.7** | **10.8** | **10.9** |
39
+ | [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2) | 756 | 5.8 | 10.1 | 15.6 | 11.6 |
40
+
41
+ Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
42
+ (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
43
+ You can expect significant performance gains by switching from previous Distil-Whisper checkpoints to distil-large-v3
44
+ when using these libraries. For convenience, the weights for the most popular libraries are already converted,
45
+ with instructions for getting started below.
46
+
47
+ ## Table of Contents
48
+
49
+ 1. [Transformers Usage](#transformers-usage)
50
+ * [Short-Form Transcription](#short-form-transcription)
51
+ * [Sequential Long-Form](#sequential-long-form)
52
+ * [Chunked Long-Form](#chunked-long-form)
53
+ * [Speculative Decoding](#speculative-decoding)
54
+ * [Additional Speed and Memory Improvements](#additional-speed--memory-improvements)
55
+ 2. [Library Integrations](#library-integrations)
56
+ * [Whisper cpp](#whispercpp)
57
+ * [Faster Whisper](#faster-whisper)
58
+ 3. [Model Details](#model-details)
59
+
60
+
61
+ ## Transformers Usage
62
+
63
+ distil-large-v3 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
64
+ install the latest version of Transformers. For this example, we'll also install 🤗 Datasets to load a toy audio dataset
65
+ from the Hugging Face Hub:
66
+
67
+ ```bash
68
+ pip install --upgrade pip
69
+ pip install --upgrade transformers accelerate datasets[audio]
70
+ ```
71
+
72
+ ### Short-Form Transcription
73
+
74
+ The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
75
+ class to transcribe short-form audio files (< 30-seconds) as follows:
76
+
77
+ ```python
78
+ import torch
79
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
80
+ from datasets import load_dataset
81
+
82
+
83
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
84
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
85
+
86
+ model_id = "kotoba-tech/kotoba-whisper-v1.0"
87
+
88
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
89
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
90
+ )
91
+ model.to(device)
92
+
93
+ processor = AutoProcessor.from_pretrained(model_id)
94
+
95
+ pipe = pipeline(
96
+ "automatic-speech-recognition",
97
+ model=model,
98
+ tokenizer=processor.tokenizer,
99
+ feature_extractor=processor.feature_extractor,
100
+ max_new_tokens=128,
101
+ torch_dtype=torch_dtype,
102
+ device=device,
103
+ )
104
+
105
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
106
+ sample = dataset[0]["audio"]
107
+
108
+ result = pipe(sample)
109
+ print(result["text"])
110
+ ```
111
+
112
+ To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
113
+ ```diff
114
+ - result = pipe(sample)
115
+ + result = pipe("audio.mp3")
116
+ ```
117
+
118
+ For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
119
+ ```python
120
+ result = pipe(sample, return_timestamps=True)
121
+ print(result["chunks"])
122
+ ```
123
+
124
+ <details>
125
+
126
+ <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
127
+
128
+ Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps`
129
+ for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)
130
+ for more details.
131
+
132
+ ```python
133
+ import torch
134
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
135
+ from datasets import Audio, load_dataset
136
+
137
+
138
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
139
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
140
+
141
+ model_id = "kotoba-tech/kotoba-whisper-v1.0"
142
+
143
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
144
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
145
+ )
146
+ model.to(device)
147
+
148
+ processor = AutoProcessor.from_pretrained(model_id)
149
+
150
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
151
+ dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
152
+ sample = dataset[0]["audio"]
153
+
154
+ input_features = processor(
155
+ sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
156
+ ).input_features
157
+
158
+ input_features = input_features.to(device, dtype=torch_dtype)
159
+
160
+ gen_kwargs = {
161
+ "max_new_tokens": 128,
162
+ "num_beams": 1,
163
+ "return_timestamps": False,
164
+ }
165
+
166
+ pred_ids = model.generate(input_features, **gen_kwargs)
167
+ pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
168
+
169
+ print(pred_text)
170
+ ```
171
+
172
+ </details>
173
+
174
+ ### Sequential Long-Form
175
+
176
+ Unlike previous Distil-Whisper releases, distil-large-v3 is specifically designed to be compatible with OpenAI's sequential
177
+ long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),
178
+ and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
179
+
180
+ The sequential long-form algorithm should be used in either of the following scenarios:
181
+ 1. Transcription accuracy is the most important factor, and latency is less of a consideration
182
+ 2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
183
+
184
+ If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm
185
+ described [below](#chunked-long-form). For a detailed explanation of the different algorithms, refer to Sections 5 of
186
+ the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf).
187
+
188
+ The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
189
+ class can be used to transcribe long audio files with the sequential algorithm as follows:
190
+
191
+ ```python
192
+ import torch
193
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
194
+ from datasets import load_dataset
195
+
196
+
197
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
198
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
199
+
200
+ model_id = "kotoba-tech/kotoba-whisper-v1.0"
201
+
202
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
203
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
204
+ )
205
+ model.to(device)
206
+
207
+ processor = AutoProcessor.from_pretrained(model_id)
208
+
209
+ pipe = pipeline(
210
+ "automatic-speech-recognition",
211
+ model=model,
212
+ tokenizer=processor.tokenizer,
213
+ feature_extractor=processor.feature_extractor,
214
+ max_new_tokens=128,
215
+ torch_dtype=torch_dtype,
216
+ device=device,
217
+ )
218
+
219
+ dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
220
+ sample = dataset[0]["audio"]
221
+
222
+ result = pipe(sample)
223
+ print(result["text"])
224
+ ```
225
+
226
+ <details>
227
+
228
+ <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
229
+
230
+ ```python
231
+ import torch
232
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
233
+ from datasets import Audio, load_dataset
234
+
235
+
236
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
237
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
238
+
239
+ model_id = "kotoba-tech/kotoba-whisper-v1.0"
240
+
241
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
242
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
243
+ )
244
+ model.to(device)
245
+
246
+ processor = AutoProcessor.from_pretrained(model_id)
247
+
248
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
249
+ dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
250
+ sample = dataset[0]["audio"]
251
+
252
+ inputs = processor(
253
+ sample["array"],
254
+ sampling_rate=sample["sampling_rate"],
255
+ return_tensors="pt",
256
+ truncation=False,
257
+ padding="longest",
258
+ return_attention_mask=True,
259
+ )
260
+ inputs = inputs.to(device, dtype=torch_dtype)
261
+
262
+ gen_kwargs = {
263
+ "max_new_tokens": 448,
264
+ "num_beams": 1,
265
+ "condition_on_prev_tokens": False,
266
+ "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
267
+ "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
268
+ "logprob_threshold": -1.0,
269
+ "no_speech_threshold": 0.6,
270
+ "return_timestamps": True,
271
+ }
272
+
273
+ pred_ids = model.generate(**i nputs, **gen_kwargs)
274
+ pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
275
+
276
+ print(pred_text)
277
+ ```
278
+
279
+ </details>
280
+
281
+ ### Chunked Long-Form
282
+
283
+ distil-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when
284
+ a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
285
+ the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the
286
+ [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
287
+
288
+ To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
289
+ is optimal. To activate batching over long audio files, pass the argument `batch_size`:
290
+
291
+ ```python
292
+ import torch
293
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
294
+ from datasets import load_dataset
295
+
296
+
297
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
298
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
299
+
300
+ model_id = "kotoba-tech/kotoba-whisper-v1.0"
301
+
302
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
303
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
304
+ )
305
+ model.to(device)
306
+
307
+ processor = AutoProcessor.from_pretrained(model_id)
308
+
309
+ pipe = pipeline(
310
+ "automatic-speech-recognition",
311
+ model=model,
312
+ tokenizer=processor.tokenizer,
313
+ feature_extractor=processor.feature_extractor,
314
+ max_new_tokens=128,
315
+ chunk_length_s=25,
316
+ batch_size=16,
317
+ torch_dtype=torch_dtype,
318
+ device=device,
319
+ )
320
+
321
+ dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
322
+ sample = dataset[0]["audio"]
323
+
324
+ result = pipe(sample)
325
+ print(result["text"])
326
+ ```
327
+
328
+
329
+ ### Additional Speed & Memory Improvements
330
+
331
+ You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM
332
+ requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
333
+ more efficient flash attention version.
334
+
335
+ #### Flash Attention 2
336
+
337
+ We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
338
+ if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
339
+
340
+ ```
341
+ pip install flash-attn --no-build-isolation
342
+ ```
343
+
344
+ Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
345
+
346
+ ```diff
347
+ - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
348
+ + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
349
+ ```
350
+
351
+ #### Torch Scale-Product-Attention (SDPA)
352
+
353
+ If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
354
+ This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check
355
+ whether you have a compatible PyTorch version, run the following Python code snippet:
356
+
357
+ ```python
358
+ from transformers.utils import is_torch_sdpa_available
359
+
360
+ print(is_torch_sdpa_available())
361
+ ```
362
+
363
+ If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it
364
+ returns `False`, you need to upgrade your PyTorch version according to the [official instructions](https://pytorch.org/get-started/locally/)
365
+
366
+ Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
367
+ `attn_implementation="sdpa"` as follows:
368
+
369
+ ```diff
370
+ - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
371
+ + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
372
+ ```
373
+
374
+ ## Library Integrations
375
+
376
+ ### Whisper.cpp
377
+
378
+ Distil-Whisper can be run with the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) package with the original
379
+ sequential long-form transcription algorithm. In a provisional benchmark on Mac M1, distil-large-v3 is over 5x faster
380
+ than Whisper large-v3, while performing to within 0.8% WER over long-form audio.
381
+
382
+ Steps for getting started:
383
+
384
+ 1. Clone the Whisper.cpp repository:
385
+ ```
386
+ git clone https://github.com/ggerganov/whisper.cpp.git
387
+ cd whisper.cpp
388
+ ```
389
+ 2. Install the Hugging Face Hub Python package:
390
+ ```bash
391
+ pip install --upgrade huggingface_hub
392
+ ```
393
+ And download the GGML weights for distil-large-v3 using the following Python snippet:
394
+
395
+ ```python
396
+ from huggingface_hub import hf_hub_download
397
+
398
+ hf_hub_download(repo_id='kotoba-tech/kotoba-whisper-v1.0-ggml', filename='ggml-distil-large-v3.bin', local_dir='./models')
399
+ ```
400
+
401
+ Note that if you do not have a Python environment set-up, you can also download the weights directly with `wget`:
402
+
403
+ ```bash
404
+ wget https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0-ggml/resolve/main/ggml-distil-large-v3.bin -P ./models
405
+ ```
406
+
407
+ 3. Run inference using the provided sample audio:
408
+
409
+ ```bash
410
+ make -j && ./main -m models/ggml-distil-large-v3.bin -f samples/jfk.wav
411
+ ```
412
+
413
+ ### Faster-Whisper
414
+
415
+ Faster-Whisper is a reimplementation of Whisper using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), a fast
416
+ inference engine for Transformer models.
417
+
418
+ First, install the Faster-Whisper package according to the [official instructions](https://github.com/SYSTRAN/faster-whisper#installation).
419
+ For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:
420
+
421
+ ```bash
422
+ pip install --upgrade pip
423
+ pip install --upgrade git+https://github.com/SYSTRAN/faster-whisper datasets[audio]
424
+ ```
425
+
426
+ The following code snippet loads the distil-large-v3 model and runs inference on an example file from the LibriSpeech ASR
427
+ dataset:
428
+
429
+ ```python
430
+ import torch
431
+ from faster_whisper import WhisperModel
432
+ from datasets import load_dataset
433
+
434
+ # define our torch configuration
435
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
436
+ compute_type = "float16" if torch.cuda.is_available() else "float32"
437
+
438
+ # load model on GPU if available, else cpu
439
+ model = WhisperModel("distil-large-v3", device=device, compute_type=compute_type)
440
+
441
+ # load toy dataset for example
442
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
443
+ sample = dataset[1]["audio"]["path"]
444
+
445
+ segments, info = model.transcribe(sample, beam_size=1)
446
+
447
+ for segment in segments:
448
+ print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
449
+ ```
450
+
451
+ To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe:
452
+
453
+ ```python
454
+ segments, info = model.transcribe("audio.mp3", beam_size=1)
455
+ ```
456
+
457
+
458
+ ## Model Details
459
+
460
+ Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector
461
+ inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all
462
+ previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder
463
+ is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of
464
+ total inference time. Thus, to optimise for latency, the focus is on minimising the inference time of the decoder.
465
+
466
+ To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed.
467
+ The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training.
468
+ The student's decoder consists of a subset of the teacher decoder layers, which are intialised from maximally spaced layers.
469
+ The model is then trained on a weighted sum of the KL divergence and pseudo-label loss terms.
470
+
471
+ <p align="center">
472
+ <img src="https://huggingface.co/datasets/distil-whisper/figures/resolve/main/architecture.png?raw=true" width="600"/>
473
+ </p>
474
+
475
+ ## Evaluation
476
+
477
+ The following code-snippets demonstrates how to evaluate the Distil-Whisper model on the LibriSpeech validation-clean
478
+ dataset with [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet), meaning no
479
+ audio data has to be downloaded to your local device.
480
+
481
+ First, we need to install the required packages, including 🤗 Datasets to stream and load the audio data, and 🤗 Evaluate to
482
+ perform the WER calculation:
483
+
484
+ ```bash
485
+ pip install --upgrade pip
486
+ pip install --upgrade transformers datasets[audio] evaluate jiwer
487
+ ```
488
+
489
+ Evaluation can then be run end-to-end with the following example:
490
+
491
+ ```python
492
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
493
+ from datasets import load_dataset
494
+ from evaluate import load
495
+ import torch
496
+ from tqdm import tqdm
497
+
498
+ # define our torch configuration
499
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
500
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
501
+
502
+ model_id = "kotoba-tech/kotoba-whisper-v1.0"
503
+
504
+ # load the model + processor
505
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, use_safetensors=True, low_cpu_mem_usage=True)
506
+ model = model.to(device)
507
+ processor = AutoProcessor.from_pretrained(model_id)
508
+
509
+ # load the dataset with streaming mode
510
+ dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
511
+
512
+ # define the evaluation metric
513
+ wer_metric = load("wer")
514
+
515
+ def inference(batch):
516
+ # 1. Pre-process the audio data to log-mel spectrogram inputs
517
+ audio = [sample["array"] for sample in batch["audio"]]
518
+ input_features = processor(audio, sampling_rate=batch["audio"][0]["sampling_rate"], return_tensors="pt").input_features
519
+ input_features = input_features.to(device, dtype=torch_dtype)
520
+
521
+ # 2. Auto-regressively generate the predicted token ids
522
+ pred_ids = model.generate(input_features, max_new_tokens=128)
523
+
524
+ # 3. Decode the token ids to the final transcription
525
+ batch["transcription"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
526
+ batch["reference"] = batch["text"]
527
+ return batch
528
+
529
+ # batch size 16 inference
530
+ dataset = dataset.map(function=inference, batched=True, batch_size=16)
531
+
532
+ all_transcriptions = []
533
+ all_references = []
534
+
535
+ # iterate over the dataset and run inference
536
+ for result in tqdm(dataset, desc="Evaluating..."):
537
+ all_transcriptions.append(result["transcription"])
538
+ all_references.append(result["reference"])
539
+
540
+ # normalize predictions and references
541
+ all_transcriptions = [processor.normalize(transcription) for transcription in all_transcriptions]
542
+ all_references = [processor.normalize(reference) for reference in all_references]
543
+
544
+ # compute the WER metric
545
+ wer = 100 * wer_metric.compute(predictions=all_transcriptions, references=all_references)
546
+ print(wer)
547
+
548
+ ```
549
+ **Print Output:**
550
+ ```
551
+ 2.428920763531516
552
+ ```
553
+
554
+
555
+ ## Data
556
+
557
+ Distil-Whisper is trained on 22,000 hours of audio data from nine open-source, permissively licensed speech datasets on the
558
+ Hugging Face Hub:
559
+
560
+ | Dataset | Size / h | Speakers | Domain | Licence |
561
+ |-----------------------------------------------------------------------------------------|----------|----------|-----------------------------|-----------------|
562
+ | [People's Speech](https://huggingface.co/datasets/MLCommons/peoples_speech) | 12,000 | unknown | Internet Archive | CC-BY-SA-4.0 |
563
+ | [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | 3,000 | unknown | Narrated Wikipedia | CC0-1.0 |
564
+ | [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | 2,500 | unknown | Audiobook, podcast, YouTube | apache-2.0 |
565
+ | Fisher | 1,960 | 11,900 | Telephone conversations | LDC |
566
+ | [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | 960 | 2,480 | Audiobooks | CC-BY-4.0 |
567
+ | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 540 | 1,310 | European Parliament | CC0 |
568
+ | [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | 450 | 2,030 | TED talks | CC-BY-NC-ND 3.0 |
569
+ | SwitchBoard | 260 | 540 | Telephone conversations | LDC |
570
+ | [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | 100 | unknown | Meetings | CC-BY-4.0 |
571
+ ||||||
572
+ | **Total** | 21,770 | 18,260+ | | |
573
+
574
+ The combined dataset spans 10 distinct domains and over 50k speakers. The diversity of this dataset is crucial to ensuring
575
+ the distilled model is robust to audio distributions and noise.
576
+
577
+ The audio data is then pseudo-labelled using the Whisper large-v3 model: we use Whisper to generate predictions for all
578
+ the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
579
+ transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
580
+
581
+ ## WER Filter
582
+
583
+ The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
584
+ accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
585
+ and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds
586
+ a specified threshold, we discard the training example. Otherwise, we keep it for training.
587
+
588
+ Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demonstrates the effectiveness of this filter
589
+ for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
590
+ hallucinations to this filter.
591
+
592
+ ## Training
593
+
594
+ The model was trained for 80,000 optimisation steps (or 11 epochs) with batch size 256. The Tensorboard training logs can
595
+ be found under: https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0/tensorboard?params=scalars#frame
596
+
597
+ ## Results
598
+
599
+ The distilled model performs to within 1.5% WER of Whisper large-v3 on out-of-distribution (OOD) short-form audio, within
600
+ 1% WER on sequential long-form decoding, and outperforms large-v3 by 0.1% on chunked long-form. This performance gain is
601
+ attributed to lower hallucinations.
602
+
603
+ For a detailed per-dataset breakdown of the evaluation results, refer to Tables 16 and 17 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)
604
+
605
+ Distil-Whisper is also evaluated on the [ESB benchmark](https://arxiv.org/abs/2210.13352) datasets as part of the [OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard),
606
+ where it performs to within 0.2% WER of Whisper.
607
+
608
+ ## Reproducing Kotoba-Whisper
609
+ Training and evaluation code to reproduce Kotoba-Whisper is available at the repository: [TBA](TBA).
610
+
611
+ ## Acknowledgements
612
+ * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
613
+ * Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.
614
+ * Hugging Face 🤗 for sharing the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).