Whisper Small Model Card

Whisper Small is a pre-trained model for automatic speech recognition (ASR) and speech translation. It is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. The model has 244 million parameters and is multilingual

Performance

Whisper Small has a high accuracy and can generalize well to many datasets and domains without the need for fine-tuning.

Usage

To transcribe audio samples, the model has to be used alongside a WhisperProcessor. The WhisperProcessor is used to pre-process the audio inputs (converting them to log-Mel spectrograms for the model) and post-process the model outputs (converting them from tokens to text).

References

Model Details

Whisper is a transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2.

The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained on both speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio.

Uses

Transcription
Translation

Training hyperparameters

learning_rate: 1e-5
train_batch_size: 8
eval_batch_size: 8
lr_scheduler_warmup_steps: 500
max_steps: 4000
metric_for_best_model: wer

kanyekuthi
/

dsn_afrispeech