Model Discussion

#2
by sanchit-gandhi - opened

Hey @Aspik101 ! Super cool to see that you distilled Whisper large-v3 on Polish! Out of curiosity, did you use pseudo-label targets? Or the text labels from the Common Voice dataset? I tried an experiment distilling large-v3 for German directly on the text labels provided in the Common Voice dataset (not pseudo-labels): https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd

=> this gave a model with 6.3% WER on the CV German dev set, which was ~1.5% lower WER than using shrink and fine-tune. Therefore, this approach forms a nice intermediate between fine-tuning and full distillation: we use the KD objective, but the text labels from the CV dataset. It’s quite quick to get working, since we can skip the lengthy pseudo-labelling step, but quite clearly outperforms a simple shrink and fine-tune, so forms a nice baseline on the way to full distillation (using pseudo-labels).

I left some detailed instructions for reproducing the run here: https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd#training-procedure
You can simply swap the language ("de") for your language of choice

Of course, if you've done full KD and trained on the pseudo-label targets, you'll probably have achieved similar (if not better) performance to using the text label targets with KD! Interested in hearing what objective you used, and also what datasets you trained on!

If you're interested in discussing more about training tips, definitely join our Whisper Distillation Slack channels. Details can be found in this LinkedIn post:
https://www.linkedin.com/posts/sanchit-gandhi_distil-whisper-training-code-now-available-activity-7131004471806980096-FYmQ?utm_source=share&utm_medium=member_desktop

Hi sanchit.
I used pseudo-label targets. The model was trained on the Common Voice 13, FLEURS, and VoxPopuli dataset. I mention it in the post https://www.linkedin.com/feed/update/urn:li:activity:7135694818164289536.
However, I will also check your approach ;)

While we're talking, did you manage to get the timestamp for the calls as well? I can't do this in my model...

Very cool! Thanks for sharing details @Aspik101 !

For timestamps, we need to ensure three things:

  1. Pseudo-label the transcription with timestamps: set --return_timestamps=True in the pseudo-labelling step here: https://github.com/huggingface/distil-whisper/tree/main/training#1-pseudo-labelling
  2. Train on the transcriptions with timestamps: set --timestamp_probability=0.5 in the training set here: https://github.com/huggingface/distil-whisper/tree/main/training#3-training (note that it defaults to 0.2, which should be sufficient for training the model on the timestamp task)
  3. Inference with timestamps enabled: see how we set return_timestamps=True in the call to the pipeline
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Aspik101/distil-whisper-large-v3-pl"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("mozilla-foundation/common_voice_16_0", "pl", split="validation", streaming=True)
sample = next(iter(dataset))["audio"]

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

Print Output:

[{'timestamp': (0.0, 3.98), 'text': ' odpowiedział zwak i wzruszył ramionami.'}]

=> looks like you followed the pseudo-labelling and training steps as required! You could try increasing the timestamp probability to give more training examples with timestamps

@sanchit-gandhi Thanks a lot!

Hi @Aspik101 thanks for sharing your work on this. It would be great to see even more experimentation in this space, so thank you for kickstarting it.

Did you find this model generalizes well for your use case? I tried a couple of my recordings and it seems like the standard medium openai whisper model produces more accurate (and more extensive) results for a similar model size and processing time. Is this model something you plan to use in production or was it mostly an experiment?

Hi @ephemer , the model was trained mainly on the common voice dataset, and this one contains practically only short recordings. In my opinion, this model needs to be trained on longer recordings. According to my tests, this model works faster than whisper medium, you can check here: https://lnkd.in/dZhDc9Ra
However, if you want to use it in your process, I think you need fine-tuning of this model on more diverse and more difficult data than common voice.

@Aspik101 thanks for the links and especially for the Colab with the comparisons, that's really useful. Yes we're missing good long-form datasets for better transcriptions generally for multilingual work.

Training it on longer audio utterances (i.e. ones packed to 30-seconds) will improve the model's ability to operate in the long-form transcription mode (i.e. inference of audios > 30-seconds), c.f. this info. As @Aspik101 mentioned, training on more diverse data will improve the model's ability to generalise to different distributions of audio data (we used 22k hours from 10 distributions for training distil-large-v3). There are plenty of tips in the Distil-Whisper repo for achieving this! https://github.com/huggingface/distil-whisper

Sign up or log in to comment