File size: 6,644 Bytes
72ddd48
 
 
 
 
 
 
b99a4f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72ddd48
 
 
 
 
 
 
 
 
 
 
98ac223
 
 
72ddd48
 
 
 
 
 
 
 
 
 
 
 
98ac223
72ddd48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98ac223
72ddd48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
610accf
72ddd48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272cbc6
 
72ddd48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
license: mit
datasets:
- mozilla-foundation/common_voice_15_0
language:
- de
library_name: transformers
base_model: openai/whisper-large-v3
model-index:
- name: Distil-Whisper large-v3 De
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 15.0
      type: mozilla-foundation/common_voice_15_0
      args: 'Config: de'
    metrics:
    - type: wer
      value: 6.324
      name: Wer
---

# Distil-Whisper large-v3 German

This model is a knowledge-distilled version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) on the German subest of the [Common Voice 15.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_15_0) dataset.
It was trained using the [Distil-Whisper training code](https://github.com/huggingface/distil-whisper/tree/main/training) on the knowledge-distillation objective, using the large-v3 model as the teacher.

It achieves the following WER results on the evaluation set:
- Normalised WER: 6.324
- Orthographic WER: 8.233

Full tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame),
and steps to reproduce [here](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd#training-procedure).

## Model description

We copy the entire encoder module and freeze it during training. We copy only two decoder layers, which are initialised from the first and last decoder layers from Whisper. All other decoder layers from Whisper are discarded. 
The model is trained on a knowledge distillation objective. Specifically, it is trained to minimise the KL divergence between the distilled model and the Whisper model, as well as the cross-entropy loss on the labelled Common Voice audio data.
For more details, refer to the Distil-Whisper [repository](https://github.com/huggingface/distil-whisper/tree/main/training) and [paper](https://arxiv.org/abs/2311.00430).

## Training and evaluation data

The model was trained and evaluated on the German subset of the [Common Voice 15.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_15_0) dataset.

## Training procedure

To reproduce this training run, first clone and install Distil-Whisper according to the instructions [here](https://github.com/huggingface/distil-whisper/tree/main/training#requirements). 

Next, we can pick a name for our distilled model, e.g. `distil-whisper-large-v3-de-kd`. We can then run the following command to create a repository under this name:

```bash
huggingface-cli repo create distil-whisper-large-v3-de-kd
```

We can now see the model on the Hub, e.g. under https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd

Let's clone the repository so that we can place our training script and model weights inside:

```bash
git lfs install
git clone https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd
```

**Note:** Be sure to change the repo address to `https://huggingface.co/<your-user-name>/<your-repo-name>`

Next, copy the relevant training scrips from Distil-Whisper to the repository:

```bash
cd distil-whisper-large-v3-de-kd

cp ../distil-whisper/training/create_student_model.py .
cp ../distil-whisper/training/run_distillation.py .
```

The following command demonstrates how to initialise a student model from the Whisper [large-v3](https://huggingface.co/openai/whisper-large-v3) 
checkpoint, with all 32 encoder layer and 2 decoder layers. The 2 student decoder layers are copied from teacher layers 
1 and 32 respectively, as the maximally spaced layers:

```bash
#!/usr/bin/env bash

python create_student_model.py \
  --teacher_checkpoint "openai/whisper-large-v3" \
  --encoder_layers 32 \
  --decoder_layers 2 \
  --save_dir "./distil-large-v3-init"
```

The initialised model will be saved to the sub-directory `distil-large-v3-init` in our model repository, ready to be trained.

We can then train the model for a total of 50k steps on the German subset of the Common Voice 15 dataset by executing the following command. Note that we train 
directly on the text labels provided in the Common Voice dataset, rather than first pseudo-labelling the dataset as was done in the original [Distil-Whisper paper](https://arxiv.org/abs/2311.00430):

```bash
#!/usr/bin/env bash

accelerate launch --mixed_precision=bf16 run_distillation.py \
  --model_name_or_path "./distil-large-v3-init" \
  --teacher_model_name_or_path "openai/whisper-large-v3" \
  --train_dataset_name "mozilla-foundation/common_voice_15_0" \
  --train_dataset_config_name "de" \
  --train_split_name "train" \
  --text_column_name "sentence" \
  --eval_dataset_name "mozilla-foundation/common_voice_15_0" \
  --eval_dataset_config_name "de" \
  --eval_split_name "validation" \
  --eval_text_column_name "sentence" \
  --eval_steps 5000 \
  --save_steps 5000 \
  --warmup_steps 500 \
  --learning_rate 1e-4 \
  --lr_scheduler_type "linear" \
  --logging_steps 25 \
  --save_total_limit 1 \
  --max_steps 50000 \
  --per_device_train_batch_size 64 \
  --per_device_eval_batch_size 64 \
  --dataloader_num_workers 16 \
  --preprocessing_num_workers 16 \
  --ddp_timeout 7200 \
  --dtype "bfloat16" \
  --output_dir "./" \
  --use_pseudo_labels "false" \
  --condition_on_prev_probability "0.0" \
  --do_train \
  --do_eval \
  --gradient_checkpointing \
  --overwrite_output_dir \
  --predict_with_generate \
  --freeze_encoder \
  --streaming \
  --push_to_hub
```

On a single 80GB A100 GPU, training will take approximately 3.5 days (or 85 hours), and reach a final WER of 6.3%. Tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame). 
Note that training for longer would likely have improved the final WER performance further, since the model had not fully converged after 50k train steps.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-04
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 50000
- mixed_precision_training: Native AMP

### Training results

Tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame). 


### Framework versions

- Transformers 4.36.0.dev0
- Pytorch 2.1.2+cu121
- Datasets 2.14.7.dev0
- Tokenizers 0.14.1