Introduction

Whisper is focused on Modern Standard Arabic (MSA)(الفصحي) not the Egyptian day to day Arabic we use (العامية المصرية). Which is a huge issue since machine learning models of any kind need to be trained on similar data that relates to the problem at hand.

Data

Acquisition of annotated Egyptian Arabic voice recordings data is extremely scarce, as the majority of the data was MSA(modern standard Arabic). To solve this we had to come up with clever solutions, which included some typical approaches like collecting publicly available online data and non typical approaches like web-scraping, using youtube data and creating our own dataset with synchronized annotations. Unfortunately, we can't provide the data used in our training.

Fine-tuned using PEFT

We adopted PEFT for Whisper fine-tuning procedures using our curated dataset of Egyptian Arabic voice recordings. The results were surprisingly really good considering PEFT only trained the size of 1% of the original model as extra parameters attached to the main pre-trained model. The training was quick, the adapters were small in size, and the inference speed (the time when we operate our model) was relatively quick too.

Generation techniques and results

We then decided to experiment with the generation techniques of the transformer architecture that Whisper was built upon, particularly Beam Search and Greedy Search. We achieved a small bump in performance. We also implemented a sampling technique which makes our model nondeterministic which means same inputs can yield different output, in our experiments this resulted in a better performance overall as tested on multiple runs. The performance gain from the generation techniques dropped our WER to 30% and our Character Error Rate (CER) to a whopping 17% As tested on a the test subset of the Egyptian Arabic dataset. The running time is around 10% of the audio time, meaning that a 100 seconds audio clip will take 10 seconds to get converted to text.

Word error rate as tested on Egyptian Arabic dataset