Which language was used as the target language?

by dash8x - opened Jul 25, 2023

Jul 25, 2023

Whisper doesn't support Dhivehi out of the box, so did you make a new vocabulary for Dhivehi or use train using some other target language? Would appreciate if you can give some details on the training setup.

dash8x changed discussion status to closed Jul 25, 2023

sanchit-gandhi

Owner Jul 25, 2023

See https://huggingface.co/learn/audio-course/chapter5/fine-tuning#feature-extractor-tokenizer-and-processor

dash8x changed discussion status to open Jul 25, 2023

dash8x

Jul 25, 2023

•

edited Jul 25, 2023

@sanchit-gandhi Thank you for explaining the training process. Seems like you're using transformers.models.whisper.english_normalizer to normalize the labels and predictions for WER calculation. This results in a misleading WER value for Dhivehi, as the English Normalizer removes the vowel diacritics from the text and adds spaces between the letters inside words, making it essentially unreadable. (I'm a native Dhivehi speaker).

Please refer to the image below to see what the whisper normalizer does to Dhivehi text. To read Dhivehi, you need both the consonants and vowel diacritics.

sanchit-gandhi

Owner Jul 26, 2023

•

edited Jul 26, 2023

Hey @dash8x - we're using the BasicTextNormalizer to normalise the references and predictions to compute the normalised WER, see https://huggingface.co/learn/audio-course/chapter5/fine-tuning#evaluation-metrics. This is the 'official' multilingual normaliser released with the Whisper package. We use this normaliser to give one-to-one comparable results to the Whisper paper.

It has been highlighted on the original Whisper repository that this multilingual normaliser is too stringent, removing diacritics amongst other characters: https://github.com/openai/whisper/discussions/858. The Whisper authors haven't replied here with any update, but we strive to give one-to-one the same results that they quote in the paper and so maintain equivalence unless they update it. Feel free to raise this issue directly with the Whisper authors if it's something you want to see changed on the official model.

You're also free to ignore the normalised WER calculation and only look at the orthographic WER. Arguments for using the orthographic WER can be found in this paper: https://arxiv.org/abs/2210.13352 I personally give as much importance to the ortho WER as normalised - if you find the normalised transcriptions loose too much information, you can use the ortho WER exclusively as a benchmark for fine-tuning performance

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment