Māori language governance & removal

#12
by mmitchell - opened

Were these models trained with te reo Māori ?
If so, I propose:
(1) Working with Māori to clarify how their data can/cannot be used
(2) Re-training and re-releasing this model (and the other versions of it) with the Māori language removed from the training set

For reference on concerns with Whisper, see https://blog.papareo.nz/whisper-is-another-case-study-in-colonisation/ , which notes "The Whisper model was trained with 1381 hours of te reo Māori" without consent.

This request is in (my likely botched interpretation of) the spirit of 'kaitiakitanga' (https://papareo.nz/#kaitiakitanga) for using te reo Māori in technology.

Hey @mmitchell ! This is an English-only version of the Whisper model (tiny.en). The multilingual version of the model can be found at openai/whisper-tiny. The model card describes the various checkpoints and whether they're English-only / multilingual: openai/whisper-tiny.en#model-details

The multilingual checkpoints are indeed trained on ~1.4k hours of Maori data (see page 27 of the Whisper paper). The data was obtained in the same way as the other 679k hours of speech data (by scraping the web). Unfortunately, we don't have access to the training dataset, so re-training the model is not possible.

Hi Sanchit! Indeed, this particular (sub-)model is English, apologies, I should have clarified: This (sub-)model is the most popular download of Whisper and hence the most likely one to get the attention needed for the multiple relevant models being shared. Let me know if it would be helpful to link to all of them here. I opted not to write the same message for each (sub-)model this is specifically relevant for, and unfortunately I would be blocked from doing so due to Spam filters, but also let me know if it would help, and I can do it over several days.

More important on my end is your answer at the end: If this is a trained model that you are sharing, then who is the "we" here you refer to, where is the training dataset, and who does have access to it?

Thanks!

Hey @mmitchell ! That makes sense, thanks for clarifying! No need to copy across to other checkpoints - I just wanted to flag this to make sure that it was indeed the multilingual checkpoints we were referring to 🤗

OpenAI curated the dataset by web scraping the internet for audio data. Unfortunately, these are all the details we know about it from the Whisper paper. They subsequently trained 11 speech recognition models on this data. They released all of the models, but not the dataset. The dataset remains private and AFAIK is not going to be made open-source. So OpenAI are the only ones with access to it and the only ones who can re-train the model.

There might be another way to silence the model making Maori predictions. The format of Whisper transcriptions is as follows:

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> Hey this is some transcribed text! <|endoftranscript|>

The second token (<|en|>) is the language id token. This token informs Whisper which model to make its predictions in. If the second predicted token is <|en|>, Whisper will transcribe in English. Likewise if the second predicted token is <|es|>, the model will transcribe in Spanish, and so on...

If you want to stop the model predicting Maori transcriptions, you could explore forcing the log probability for the Maori language id token to -inf. This way, the model will never predict Maori as the transcription language, and thus will not generate Maori transcriptions.

You can achieve this by passing the Maori language id as bad_word_ids to the generate function: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.bad_words_ids(List[List[int]],

The .generate method will then take care of setting the log prob for Maori to -inf as required

Aloha @sanchit-gandhi . I do appreciate your responses and consideration for our concerns.

Do you know much about how FLEURS was obtained? This is a key dataset used in Whisper with similar issues around the lack of transparency on how the data was obtained.

It's not that we want access to the data used, we want to know what data was used, where it came from, and how it was obtained.

Hey @mahelona ! There's a very transparent paper on the FLEURS dataset that you can check-out: https://arxiv.org/abs/2205.12446

In summary, the data is derived from the FLoRes-101 dataset, a machine translation corpus with 3001 sentence translations from English to 101 other languages. Native speakers are recorded narrating the sentence transcriptions in their native language. The recorded audio data is paired with the sentence transcriptions to yield multilingual speech recognition over all 101 languages. This corpus was used for Whisper evaluation (but not training). It's available on the HuggingFace Hub here: https://huggingface.co/datasets/google/fleurs

Sign up or log in to comment