Urdu-ASR-SOTA / README.md
kingabzpro's picture
Update README.md
2665c6b
---
title: Urdu ASR SOTA
emoji: 👨‍🎤
colorFrom: green
colorTo: blue
sdk: gradio
app_file: Gradio/app.py
pinned: true
license: apache-2.0
---
# Urdu Automatic Speech Recognition State of the Art Solution
![cover](Images/cover.jpg)
Automatic Speech Recognition using Facebook's wav2vec2-xls-r-300m model and mozilla-foundation common_voice_8_0 Urdu Dataset.
## Model Finetunning
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the [common_voice dataset](https://commonvoice.mozilla.org/en/datasets).
It achieves the following results on the evaluation set:
- Loss: 0.9889
- Wer: 0.5607
- Cer: 0.2370
## Quick Prediction
Install all dependecies using `requirment.txt` file and then run bellow command to predict the text:
```python
import torch
from datasets import load_dataset, Audio
from transformers import pipeline
model = "Model"
data = load_dataset("Data", "ur", split="test", delimiter="\t")
def path_adjust(batch):
batch["path"] = "Data/ur/clips/" + str(batch["path"])
return batch
data = data.map(path_adjust)
sample_iter = iter(data.cast_column("path", Audio(sampling_rate=16_000)))
sample = next(sample_iter)
asr = pipeline("automatic-speech-recognition", model=model)
prediction = asr(
sample["path"]["array"], chunk_length_s=5, stride_length_s=1)
prediction
# => {'text': 'اب یہ ونگین لمحاتانکھار دلمیں میںفوث کریلیا اجائ'}
```
## Evaluation Commands
To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`, you can copy and past the command to the terminal.
```bash
python3 eval.py --model_id Model --dataset Data --config ur --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
```
**OR**
Run the simple shell script
```bash
bash run_eval.sh
```
## Language Model
[Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)
- Get suitable Urdu text data for a language model
- Build an n-gram with KenLM
- Combine the n-gram with a fine-tuned Wav2Vec2 checkpoint
Install kenlm and pyctcdecode before running the notebook.
```bash
pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
```
## Eval Results
| Without LM | With LM |
| ---------- | ------- |
| 56.21 | 46.37 |
## Directory Structure
```
<root directory>
|
.- README.md
|
.- Data/
|
.- Model/
|
.- Images/
|
.- Sample/
|
.- Gradio/
|
.- Eval Results/
|
.- With LM/
|
.- Without LM/
| ...
.- notebook.ipynb
|
.- run_eval.sh
|
.- eval.py
```
## Gradio App
## SOTA
- [x] Add Language Model
- [x] Webapp/API
- [] Denoise Audio
- [] Text Processing
- [] Spelling Mistakes
- [x] Hyperparameters optimization
- [] Training on 300 Epochs & 64 Batch Size
- [] Improved Language Model
- [] Contribute to Urdu ASR Audio Dataset
## Robust Speech Recognition Challenge 2022
This project was the results of HuggingFace [Robust Speech Recognition Challenge](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614). I was one of the winner with four state of the art ASR model. Check out my SOTA checkpoints.
- **[Urdu](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)**
- **[Arabic](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-300-arabic)**
- **[Punjabi](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-53-punjabi)**
- **[Irish](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-1b-Irish)**
![winner](Images/winner.png)
## References
- [Common Voice Dataset](https://commonvoice.mozilla.org/en/datasets)
- [Sequence Modeling With CTC](https://distill.pub/2017/ctc/)
- [Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)
- [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)
- [HF Model](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)