--- title: Urdu ASR SOTA emoji: 👨‍🎤 colorFrom: green colorTo: blue sdk: gradio app_file: Gradio/app.py pinned: true license: apache-2.0 --- # Urdu Automatic Speech Recognition State of the Art Solution ![cover](Images/cover.jpg) Automatic Speech Recognition using Facebook's wav2vec2-xls-r-300m model and mozilla-foundation common_voice_8_0 Urdu Dataset. ## Model Finetunning This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the [common_voice dataset](https://commonvoice.mozilla.org/en/datasets). It achieves the following results on the evaluation set: - Loss: 0.9889 - Wer: 0.5607 - Cer: 0.2370 ## Quick Prediction Install all dependecies using `requirment.txt` file and then run bellow command to predict the text: ```python import torch from datasets import load_dataset, Audio from transformers import pipeline model = "Model" data = load_dataset("Data", "ur", split="test", delimiter="\t") def path_adjust(batch): batch["path"] = "Data/ur/clips/" + str(batch["path"]) return batch data = data.map(path_adjust) sample_iter = iter(data.cast_column("path", Audio(sampling_rate=16_000))) sample = next(sample_iter) asr = pipeline("automatic-speech-recognition", model=model) prediction = asr( sample["path"]["array"], chunk_length_s=5, stride_length_s=1) prediction # => {'text': 'اب یہ ونگین لمحاتانکھار دلمیں میںفوث کریلیا اجائ'} ``` ## Evaluation Commands To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`, you can copy and past the command to the terminal. ```bash python3 eval.py --model_id Model --dataset Data --config ur --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs ``` **OR** Run the simple shell script ```bash bash run_eval.sh ``` ## Language Model [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram) - Get suitable Urdu text data for a language model - Build an n-gram with KenLM - Combine the n-gram with a fine-tuned Wav2Vec2 checkpoint Install kenlm and pyctcdecode before running the notebook. ```bash pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode ``` ## Eval Results | Without LM | With LM | | ---------- | ------- | | 56.21 | 46.37 | ## Directory Structure ``` | .- README.md | .- Data/ | .- Model/ | .- Images/ | .- Sample/ | .- Gradio/ | .- Eval Results/ | .- With LM/ | .- Without LM/ | ... .- notebook.ipynb | .- run_eval.sh | .- eval.py ``` ## Gradio App ## SOTA - [x] Add Language Model - [x] Webapp/API - [] Denoise Audio - [] Text Processing - [] Spelling Mistakes - [x] Hyperparameters optimization - [] Training on 300 Epochs & 64 Batch Size - [] Improved Language Model - [] Contribute to Urdu ASR Audio Dataset ## Robust Speech Recognition Challenge 2022 This project was the results of HuggingFace [Robust Speech Recognition Challenge](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614). I was one of the winner with four state of the art ASR model. Check out my SOTA checkpoints. - **[Urdu](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)** - **[Arabic](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-300-arabic)** - **[Punjabi](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-53-punjabi)** - **[Irish](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-1b-Irish)** ![winner](Images/winner.png) ## References - [Common Voice Dataset](https://commonvoice.mozilla.org/en/datasets) - [Sequence Modeling With CTC](https://distill.pub/2017/ctc/) - [Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) - [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram) - [HF Model](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)