--- license: mit datasets: - mozilla-foundation/common_voice_15_0 language: - de library_name: transformers base_model: openai/whisper-large-v3 model-index: - name: Distil-Whisper large-v3 De results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Common Voice 15.0 type: mozilla-foundation/common_voice_15_0 args: 'Config: de' metrics: - type: wer value: 6.324 name: Wer --- # Distil-Whisper large-v3 German This model is a knowledge-distilled version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) on the German subest of the [Common Voice 15.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_15_0) dataset. It was trained using the [Distil-Whisper training code](https://github.com/huggingface/distil-whisper/tree/main/training) on the knowledge-distillation objective, using the large-v3 model as the teacher. It achieves the following WER results on the evaluation set: - Normalised WER: 6.324 - Orthographic WER: 8.233 Full tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame), and steps to reproduce [here](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd#training-procedure). ## Model description We copy the entire encoder module and freeze it during training. We copy only two decoder layers, which are initialised from the first and last decoder layers from Whisper. All other decoder layers from Whisper are discarded. The model is trained on a knowledge distillation objective. Specifically, it is trained to minimise the KL divergence between the distilled model and the Whisper model, as well as the cross-entropy loss on the labelled Common Voice audio data. For more details, refer to the Distil-Whisper [repository](https://github.com/huggingface/distil-whisper/tree/main/training) and [paper](https://arxiv.org/abs/2311.00430). ## Training and evaluation data The model was trained and evaluated on the German subset of the [Common Voice 15.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_15_0) dataset. ## Training procedure To reproduce this training run, first clone and install Distil-Whisper according to the instructions [here](https://github.com/huggingface/distil-whisper/tree/main/training#requirements). Next, we can pick a name for our distilled model, e.g. `distil-whisper-large-v3-de-kd`. We can then run the following command to create a repository under this name: ```bash huggingface-cli repo create distil-whisper-large-v3-de-kd ``` We can now see the model on the Hub, e.g. under https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd Let's clone the repository so that we can place our training script and model weights inside: ```bash git lfs install git clone https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd ``` **Note:** Be sure to change the repo address to `https://huggingface.co//` Next, copy the relevant training scrips from Distil-Whisper to the repository: ```bash cd distil-whisper-large-v3-de-kd cp ../distil-whisper/training/create_student_model.py . cp ../distil-whisper/training/run_distillation.py . ``` The following command demonstrates how to initialise a student model from the Whisper [large-v3](https://huggingface.co/openai/whisper-large-v3) checkpoint, with all 32 encoder layer and 2 decoder layers. The 2 student decoder layers are copied from teacher layers 1 and 32 respectively, as the maximally spaced layers: ```bash #!/usr/bin/env bash python create_student_model.py \ --teacher_checkpoint "openai/whisper-large-v3" \ --encoder_layers 32 \ --decoder_layers 2 \ --save_dir "./distil-large-v3-init" ``` The initialised model will be saved to the sub-directory `distil-large-v3-init` in our model repository, ready to be trained. We can then train the model for a total of 50k steps on the German subset of the Common Voice 15 dataset by executing the following command. Note that we train directly on the text labels provided in the Common Voice dataset, rather than first pseudo-labelling the dataset as was done in the original [Distil-Whisper paper](https://arxiv.org/abs/2311.00430): ```bash #!/usr/bin/env bash accelerate launch --mixed_precision=bf16 run_distillation.py \ --model_name_or_path "./distil-large-v3-init" \ --teacher_model_name_or_path "openai/whisper-large-v3" \ --train_dataset_name "mozilla-foundation/common_voice_15_0" \ --train_dataset_config_name "de" \ --train_split_name "train" \ --text_column_name "sentence" \ --eval_dataset_name "mozilla-foundation/common_voice_15_0" \ --eval_dataset_config_name "de" \ --eval_split_name "validation" \ --eval_text_column_name "sentence" \ --eval_steps 5000 \ --save_steps 5000 \ --warmup_steps 500 \ --learning_rate 1e-4 \ --lr_scheduler_type "linear" \ --logging_steps 25 \ --save_total_limit 1 \ --max_steps 50000 \ --per_device_train_batch_size 64 \ --per_device_eval_batch_size 64 \ --dataloader_num_workers 16 \ --preprocessing_num_workers 16 \ --ddp_timeout 7200 \ --dtype "bfloat16" \ --output_dir "./" \ --use_pseudo_labels "false" \ --condition_on_prev_probability "0.0" \ --do_train \ --do_eval \ --gradient_checkpointing \ --overwrite_output_dir \ --predict_with_generate \ --freeze_encoder \ --streaming \ --push_to_hub ``` On a single 80GB A100 GPU, training will take approximately 3.5 days (or 85 hours), and reach a final WER of 6.3%. Tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame). Note that training for longer would likely have improved the final WER performance further, since the model had not fully converged after 50k train steps. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-04 - train_batch_size: 64 - eval_batch_size: 64 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - training_steps: 50000 - mixed_precision_training: Native AMP ### Training results Tensorboard logs can be found under the tab [Training Metrics](https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd/tensorboard?params=scalars#frame). ### Framework versions - Transformers 4.36.0.dev0 - Pytorch 2.1.2+cu121 - Datasets 2.14.7.dev0 - Tokenizers 0.14.1