File size: 6,644 Bytes
72ddd48 b99a4f8 72ddd48 98ac223 72ddd48 98ac223 72ddd48 98ac223 72ddd48 610accf 72ddd48 272cbc6 72ddd48 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: mit
datasets:
- mozilla-foundation/common_voice_15_0
language:
- de
library_name: transformers
base_model: openai/whisper-large-v3
model-index:
- name: Distil-Whisper large-v3 De
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 15.0
type: mozilla-foundation/common_voice_15_0
args: 'Config: de'
metrics:
- type: wer
value: 6.324
name: Wer
---
# Distil-Whisper large-v3 German
This model is a knowledge-distilled version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) on the German subest of the [Common Voice 15.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_15_0) dataset.
It was trained using the [Distil-Whisper training code](https://github.com/huggingface/distil-whisper/tree/main/training) on the knowledge-distillation objective, using the large-v3 model as the teacher.
It achieves the following WER results on the evaluation set:
- Normalised WER: 6.324
- Orthographic WER: 8.233
Full tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame),
and steps to reproduce [here](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd#training-procedure).
## Model description
We copy the entire encoder module and freeze it during training. We copy only two decoder layers, which are initialised from the first and last decoder layers from Whisper. All other decoder layers from Whisper are discarded.
The model is trained on a knowledge distillation objective. Specifically, it is trained to minimise the KL divergence between the distilled model and the Whisper model, as well as the cross-entropy loss on the labelled Common Voice audio data.
For more details, refer to the Distil-Whisper [repository](https://github.com/huggingface/distil-whisper/tree/main/training) and [paper](https://arxiv.org/abs/2311.00430).
## Training and evaluation data
The model was trained and evaluated on the German subset of the [Common Voice 15.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_15_0) dataset.
## Training procedure
To reproduce this training run, first clone and install Distil-Whisper according to the instructions [here](https://github.com/huggingface/distil-whisper/tree/main/training#requirements).
Next, we can pick a name for our distilled model, e.g. `distil-whisper-large-v3-de-kd`. We can then run the following command to create a repository under this name:
```bash
huggingface-cli repo create distil-whisper-large-v3-de-kd
```
We can now see the model on the Hub, e.g. under https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd
Let's clone the repository so that we can place our training script and model weights inside:
```bash
git lfs install
git clone https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd
```
**Note:** Be sure to change the repo address to `https://huggingface.co/<your-user-name>/<your-repo-name>`
Next, copy the relevant training scrips from Distil-Whisper to the repository:
```bash
cd distil-whisper-large-v3-de-kd
cp ../distil-whisper/training/create_student_model.py .
cp ../distil-whisper/training/run_distillation.py .
```
The following command demonstrates how to initialise a student model from the Whisper [large-v3](https://huggingface.co/openai/whisper-large-v3)
checkpoint, with all 32 encoder layer and 2 decoder layers. The 2 student decoder layers are copied from teacher layers
1 and 32 respectively, as the maximally spaced layers:
```bash
#!/usr/bin/env bash
python create_student_model.py \
--teacher_checkpoint "openai/whisper-large-v3" \
--encoder_layers 32 \
--decoder_layers 2 \
--save_dir "./distil-large-v3-init"
```
The initialised model will be saved to the sub-directory `distil-large-v3-init` in our model repository, ready to be trained.
We can then train the model for a total of 50k steps on the German subset of the Common Voice 15 dataset by executing the following command. Note that we train
directly on the text labels provided in the Common Voice dataset, rather than first pseudo-labelling the dataset as was done in the original [Distil-Whisper paper](https://arxiv.org/abs/2311.00430):
```bash
#!/usr/bin/env bash
accelerate launch --mixed_precision=bf16 run_distillation.py \
--model_name_or_path "./distil-large-v3-init" \
--teacher_model_name_or_path "openai/whisper-large-v3" \
--train_dataset_name "mozilla-foundation/common_voice_15_0" \
--train_dataset_config_name "de" \
--train_split_name "train" \
--text_column_name "sentence" \
--eval_dataset_name "mozilla-foundation/common_voice_15_0" \
--eval_dataset_config_name "de" \
--eval_split_name "validation" \
--eval_text_column_name "sentence" \
--eval_steps 5000 \
--save_steps 5000 \
--warmup_steps 500 \
--learning_rate 1e-4 \
--lr_scheduler_type "linear" \
--logging_steps 25 \
--save_total_limit 1 \
--max_steps 50000 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--dataloader_num_workers 16 \
--preprocessing_num_workers 16 \
--ddp_timeout 7200 \
--dtype "bfloat16" \
--output_dir "./" \
--use_pseudo_labels "false" \
--condition_on_prev_probability "0.0" \
--do_train \
--do_eval \
--gradient_checkpointing \
--overwrite_output_dir \
--predict_with_generate \
--freeze_encoder \
--streaming \
--push_to_hub
```
On a single 80GB A100 GPU, training will take approximately 3.5 days (or 85 hours), and reach a final WER of 6.3%. Tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame).
Note that training for longer would likely have improved the final WER performance further, since the model had not fully converged after 50k train steps.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-04
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 50000
- mixed_precision_training: Native AMP
### Training results
Tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame).
### Framework versions
- Transformers 4.36.0.dev0
- Pytorch 2.1.2+cu121
- Datasets 2.14.7.dev0
- Tokenizers 0.14.1
|