File size: 4,113 Bytes
31a2efa
 
 
73d1669
31a2efa
 
7733516
73d1669
31a2efa
 
b9df364
31a2efa
a61ebcb
31a2efa
 
 
 
 
 
a61ebcb
 
31a2efa
a61ebcb
 
 
 
31a2efa
a61ebcb
31a2efa
a61ebcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31a2efa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a61ebcb
31a2efa
a61ebcb
31a2efa
 
 
 
73d1669
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: Urdu ASR SOTA
emoji: 👨‍🎤
colorFrom: green
colorTo: blue
sdk: gradio
app_file: Gradio/app.py
pinned: true
license: apache-2.0
---

# Urdu Automatic Speech Recognition State of the Art Solution

![cover](Images/cover.jpg)
Automatic Speech Recognition using Facebook's wav2vec2-xls-r-300m model and mozilla-foundation common_voice_8_0 Urdu Dataset.

## Model Finetunning

This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the [common_voice dataset](https://commonvoice.mozilla.org/en/datasets).

It achieves the following results on the evaluation set:

- Loss: 0.9889
- Wer: 0.5607
- Cer: 0.2370

## Quick Prediction

Install all dependecies using `requirment.txt` file and then run bellow command to predict the text:

```python
import torch
from datasets import load_dataset, Audio
from transformers import pipeline
model = "Model"
data = load_dataset("Data", "ur", split="test", delimiter="\t")
def path_adjust(batch):
    batch["path"] = "Data/ur/clips/" + str(batch["path"])
    return batch
data = data.map(path_adjust)
sample_iter = iter(data.cast_column("path", Audio(sampling_rate=16_000)))
sample = next(sample_iter)

asr = pipeline("automatic-speech-recognition", model=model)
prediction = asr(
            sample["path"]["array"], chunk_length_s=5, stride_length_s=1)
prediction
# => {'text': 'اب یہ ونگین لمحاتانکھار دلمیں میںفوث کریلیا اجائ'}
```

## Evaluation Commands

To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`, you can copy and past the command to the terminal.

```bash
python3 eval.py --model_id Model --dataset Data --config ur --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
```

**OR**
Run the simple shell script

```bash
bash run_eval.sh
```

## Language Model

[Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)

- Get suitable Urdu text data for a language model
- Build an n-gram with KenLM
- Combine the n-gram with a fine-tuned Wav2Vec2 checkpoint

Install kenlm and pyctcdecode before running the notebook.

```bash
pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
```

## Eval Results

| Without LM | With LM |
| ---------- | ------- |
| 56.21      | 46.37   |

## Directory Structure

```
<root directory>
    |
    .- README.md
    |
    .- Data/
    |
    .- Model/
    |
    .- Images/
    |
    .- Sample/
    |
    .- Gradio/
    |
    .- Eval Results/
          |
          .- With LM/
          |
          .- Without LM/
          | ...
    .- notebook.ipynb
    |
    .- run_eval.sh
    |
    .- eval.py

```

## Gradio App

## SOTA

- [x] Add Language Model
- [x] Webapp/API
- [] Denoise Audio
- [] Text Processing
- [] Spelling Mistakes
- [x] Hyperparameters optimization
- [] Training on 300 Epochs & 64 Batch Size
- [] Improved Language Model
- [] Contribute to Urdu ASR Audio Dataset

## Robust Speech Recognition Challenge 2022

This project was the results of HuggingFace [Robust Speech Recognition Challenge](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614). I was one of the winner with four state of the art ASR model. Check out my SOTA checkpoints.

- **[Urdu](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)**
- **[Arabic](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-300-arabic)**
- **[Punjabi](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-53-punjabi)**
- **[Irish](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-1b-Irish)**

![winner](Images/winner.png)

## References

- [Common Voice Dataset](https://commonvoice.mozilla.org/en/datasets)
- [Sequence Modeling With CTC](https://distill.pub/2017/ctc/)
- [Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)
- [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)
- [HF Model](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)