This model was forked from the original OpenAI whisper model.

Whisper

Model

Whisper is a multi-lingual speech-to-text model. It takes in raw audio recordings from many languages and outputs transcriptions in the language of origin or translated to english. The model first converts speech to spectrograms, then uses an auto-regressive transformer to decode the speech to text. Here is an overview of the architecture:

For more information on the technical implementations, consult the paper.

Training Data

The model was trained on 680 000 hours of audio and associated transcripts trained from the internet. The majority of the audio is in english (~65%) while the remainder is in other languages. A total of 98 different languages were used in the dataset.

Model Variations

OpenAI has released 9 different versions of the model, trained either on english-only audio or on multilingual data.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x

Limitations and bias

In the paper, they find a direct corelation between performance on a given language and the amount of data available in the dataset. As such, languages that are under-represented in the scraped dataset perform less well in whisper. Because english is much more prevalent than other languages, the model will likely perform better in english. This is shown in the following figure, where a lower word error rate (WER) indicates a better performance:

jerpint
/

whisper

Whisper

Model

Training Data

Model Variations

Limitations and bias

Space using jerpint/whisper 1