Edit model card

This model was forked from the original OpenAI whisper model.

Whisper

Model

Whisper is a multi-lingual speech-to-text model. It takes in raw audio recordings from many languages and outputs transcriptions in the language of origin or translated to english. The model first converts speech to spectrograms, then uses an auto-regressive transformer to decode the speech to text. Here is an overview of the architecture:

model_architecure

For more information on the technical implementations, consult the paper.

Training Data

The model was trained on 680 000 hours of audio and associated transcripts trained from the internet. The majority of the audio is in english (~65%) while the remainder is in other languages. A total of 98 different languages were used in the dataset.

image

Model Variations

OpenAI has released 9 different versions of the model, trained either on english-only audio or on multilingual data.

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

Limitations and bias

In the paper, they find a direct corelation between performance on a given language and the amount of data available in the dataset. As such, languages that are under-represented in the scraped dataset perform less well in whisper. Because english is much more prevalent than other languages, the model will likely perform better in english. This is shown in the following figure, where a lower word error rate (WER) indicates a better performance:

model_performance

Downloads last month
0
Inference Examples
Unable to determine this model's library. Check the docs .

Space using jerpint/whisper 1