Multilingual distribution of Whisper training data

#110
by rdzotz - opened

Hey, Has anyone found a reliable reference to the language distribution of the 90+ other languages Whisper was training on. I want to understand the relative distribution of tokens in that set

Not sure if this data was ever shared broken down by individual language (even though it would be interesting to know).

But the paper gives some good clues: page 2 says :Of those 680,000 hours of audio, 117,000 hours cover 96 other languages ("other" here means "other than English").

And page 14, when talking about possible future works, there's this: Since our pre-training dataset is currently very English-heavy due to biases of our data collection pipeline, which sourced primarily from English-centric parts of the internet, most languages have less than 1000 hours of training data.

Not a proper breakdown, but at least we know that most languages have less than 1k of hours trained, and 96 languages combined adds up to 117k hours.

Sign up or log in to comment