Will spell corrector on whisper output be worth it?

#62
by MightyStudent - opened

I have trained whisper using PEFT on a small Egyptian Arabic dataset (a dialect of Arabic). The results were fine, managed to achieve a a wer of 40% (vanila whisper did 61%). Most of the mistakes were a misspelling of a character or two.

For my understanding there are two ways to improve performance more:

  • Simply just add more data (can't find more online and annotation is not feasible)
  • train more (already reached fitting point)
  • train a spell checker using Egyptian Arabic text scrapped from the interet (There are alot) and post process whisper output using it.

I'm currently working on the second method but I'll need more time, would like some of the experts opinion on my approach to the matter. Thank you!

@sanchit-gandhi

Thanks for posting @MightyStudent ! For training a spell checker, you have a sufficient corpus of {incorrect, correct} pairs of text? Are you just planning on training a vanilla encoder-decoder model for this? Note that your inference time will increase based on the size of the spell checker model that you use, whereas training with more {audio, text} data and reducing the WER of the Whisper model won't have any affect

If you don't care about inference speed, you could also run your model with beam search at inference time (slower, but more accurate). Also make sure that you're running inference without quantisation enabled for peak performance: https://github.com/huggingface/peft/discussions/477#discussion-5213394

Sign up or log in to comment