Automatic Speech Recognition
Transformers.js
ONNX
English
whisper
audio

Distil-Whisper: Distil-Large-v3.5

Distil-Whisper is the knowledge-distilled version of OpenAI's Whisper-Large-v3, described in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. As the newest addition to the Distil-Whisper English family, Distil-Large-v3.5 maintains the high efficiency of its predecessors while delivering better performance.

Compared to earlier models, Distil-Large-v3.5 has been trained on over 4× more diverse public data (98k hours) and uses a "patient" teacher with an extended training schedule and aggressive data augmentation (SpecAugment) during distillation. This results in enhanced robustness and accuracy compared to previous Distil-Whisper models, making it suitable as a drop-in replacement.

Model Params / M Rel. RTFx Short-Form OOD WER Long-Form OOD WER
large-v3-turbo 809 1.0 7.30 10.25
distil-large-v3 756 1.44 7.53 11.6
distil-large-v3.5 756 1.46 7.08 11.39

Why consider Distil-Large-v3.5 when Whisper-Large-v3-Turbo already exists?

  1. It offers a different balance between accuracy and efficiency, remains ~1.5x faster than Whisper-Large-v3-Turbo while performing slightly better on short-form transcription and falling ~1% behind on long-form transcription.
  2. It works perfectly as a draft model for speculative decoding with Whisper-Large-v3. By keeping the encoder frozen during training, we need to load just two extra decoder layers and forward the encoder only once. This achieves ~2x faster inference compared to Whisper-Large-v3 while maintaining identical outputs.

This model is a 🤗 collaborative effort between Bofeng Huang, Eustache Le Bihan, Steven Zheng, Vaibhav Srivastav, and Joshua Lochner.

Usage (Transformers.js)

If you haven't already, you can install the Transformers.js JavaScript library from NPM using:

npm i @huggingface/transformers

You can then transcribe audio as follows:

import { pipeline } from '@huggingface/transformers';

const transcriber = await pipeline('automatic-speech-recognition', 'distil-whisper/distil-large-v3.5-ONNX');

const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url);
// { text: "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." }
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including distil-whisper/distil-large-v3.5-ONNX