by sanchit-gandhi HF staff - opened

Hey @Aspik101 ! Super cool to see that you distilled Whisper large-v3 on Polish! Out of curiosity, did you use pseudo-label targets? Or the text labels from the Common Voice dataset? I tried an experiment distilling large-v3 for German directly on the text labels provided in the Common Voice dataset (not pseudo-labels):

=> this gave a model with 6.3% WER on the CV German dev set, which was ~1.5% lower WER than using shrink and fine-tune. Therefore, this approach forms a nice intermediate between fine-tuning and full distillation: we use the KD objective, but the text labels from the CV dataset. It’s quite quick to get working, since we can skip the lengthy pseudo-labelling step, but quite clearly outperforms a simple shrink and fine-tune, so forms a nice baseline on the way to full distillation (using pseudo-labels).

I left some detailed instructions for reproducing the run here:
You can simply swap the language ("de") for your language of choice

Of course, if you've done full KD and trained on the pseudo-label targets, you'll probably have achieved similar (if not better) performance to using the text label targets with KD! Interested in hearing what objective you used, and also what datasets you trained on!

If you're interested in discussing more about training tips, definitely join our Whisper Distillation Slack channels. Details can be found in this LinkedIn post:

Hi sanchit.
I used pseudo-label targets. The model was trained on the Common Voice 13, FLEURS, and VoxPopuli dataset. I mention it in the post
However, I will also check your approach ;)

While we're talking, did you manage to get the timestamp for the calls as well? I can't do this in my model...

Very cool! Thanks for sharing details @Aspik101 !

For timestamps, we need to ensure three things:

  1. Pseudo-label the transcription with timestamps: set --return_timestamps=True in the pseudo-labelling step here:
  2. Train on the transcriptions with timestamps: set --timestamp_probability=0.5 in the training set here: (note that it defaults to 0.2, which should be sufficient for training the model on the timestamp task)
  3. Inference with timestamps enabled: see how we set return_timestamps=True in the call to the pipeline
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Aspik101/distil-whisper-large-v3-pl"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(

dataset = load_dataset("mozilla-foundation/common_voice_16_0", "pl", split="validation", streaming=True)
sample = next(iter(dataset))["audio"]

result = pipe(sample, return_timestamps=True)

Print Output:

[{'timestamp': (0.0, 3.98), 'text': ' odpowiedział zwak i wzruszył ramionami.'}]

=> looks like you followed the pseudo-labelling and training steps as required! You could try increasing the timestamp probability to give more training examples with timestamps

@sanchit-gandhi Thanks a lot!

