Abid commited on
Commit
31a2efa
1 Parent(s): a61ebcb
Gradio/app.py ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from datasets import load_dataset, Audio
3
+ from transformers import pipeline
4
+ import gradio as gr
5
+
6
+ ############### HF ###########################
7
+
8
+ HF_TOKEN = os.getenv("HF_TOKEN")
9
+
10
+ hf_writer = gr.HuggingFaceDatasetSaver(HF_TOKEN, "Urdu-ASR-flags")
11
+
12
+ ############## DVC ################################
13
+
14
+ PROD_MODEL_PATH = "Model"
15
+
16
+ if os.path.isdir(".dvc"):
17
+ print("Running DVC")
18
+ os.system("dvc config cache.type copy")
19
+ os.system("dvc config core.no_scm true")
20
+ if os.system(f"dvc pull {PROD_MODEL_PATH}") != 0:
21
+ exit("dvc pull failed")
22
+ os.system("rm -r .dvc")
23
+ # .apt/usr/lib/dvc
24
+
25
+ ############## Inference ##############################
26
+
27
+
28
+ def asr(audio):
29
+
30
+ asr = pipeline("automatic-speech-recognition", model=model)
31
+ prediction = asr(audio, chunk_length_s=5, stride_length_s=1)
32
+ return prediction
33
+
34
+
35
+ ################### Gradio Web APP ################################
36
+
37
+ title = "Urdu Automatic Speech Recognition"
38
+
39
+ description = """
40
+ <p>
41
+ <center>
42
+ Savta Depth is a collaborative Open Source Data Science project for monocular depth estimation - Turn 2d photos into 3d photos. To test the model and code please check out the link bellow.
43
+ <img src="https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu/resolve/main/Image/cover.jpg" alt="logo" width="250"/>
44
+ </center>
45
+ </p>
46
+ """
47
+ article = "<p style='text-align: center'><a href='https://dagshub.com/OperationSavta/SavtaDepth' target='_blank'>SavtaDepth Project from OperationSavta</a></p><p style='text-align: center'><a href='https://colab.research.google.com/drive/1XU4DgQ217_hUMU1dllppeQNw3pTRlHy1?usp=sharing' target='_blank'>Google Colab Demo</a></p></center></p>"
48
+
49
+ examples = [["Sample/sample1.mp3"], ["Sample/sample2.mp3"], ["Sample/sample3.mp3"]]
50
+
51
+
52
+ Input = gr.inputs.Audio(
53
+ source="microphone",
54
+ type="filepath",
55
+ optional=True,
56
+ label="Please Record Your Voice",
57
+ )
58
+ Output = gr.outputs.Textbox(label="Urdu Script")
59
+
60
+
61
+ def main():
62
+ iface = gr.Interface(
63
+ asr,
64
+ Input,
65
+ Output,
66
+ title=title,
67
+ flagging_options=["incorrect", "worst", "ambiguous"],
68
+ allow_flagging="manual",
69
+ flagging_callback=hf_writer,
70
+ # description=description,
71
+ article=article,
72
+ examples=examples,
73
+ theme="peach",
74
+ )
75
+
76
+ iface.launch(enable_queue=True)
77
+
78
+
79
+ # enable_queue=True,auth=("admin", "pass1234")
80
+
81
+ if __name__ == "__main__":
82
+ main()
83
+
Images/cover.jpg ADDED
Images/winner.png ADDED
README.md CHANGED
@@ -1,27 +1,38 @@
1
- # Urdu-ASR-SOTA
 
 
 
 
 
 
 
 
 
 
2
 
3
- Automatic Speech Recognition using Facebook wav2vec2-xls-r-300m model and mozilla-foundation common_voice_8_0 Urdu Dataset.
4
 
5
- ## wav2vec2-large-xls-r-300m-Urdu
6
- This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice dataset.
 
 
 
 
7
 
8
  It achieves the following results on the evaluation set:
 
9
  - Loss: 0.9889
10
  - Wer: 0.5607
11
  - Cer: 0.2370
12
 
13
- #### Evaluation Commands
14
- To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`
15
 
16
- ```bash
17
- python3 ./eval.py --model_id ./Model --dataset ./Data --config ur --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
18
- ```
19
 
20
  ```python
21
  import torch
22
  from datasets import load_dataset, Audio
23
  from transformers import pipeline
24
- import torchaudio.functional as F
25
  model = "Model"
26
  data = load_dataset("Data", "ur", split="test", delimiter="\t")
27
  def path_adjust(batch):
@@ -38,9 +49,101 @@ prediction
38
  # => {'text': 'اب یہ ونگین لمحاتانکھار دلمیں میںفوث کریلیا اجائ'}
39
  ```
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- ### Eval results on Common Voice 8 "test" (WER):
43
 
44
- | Without LM | With LM (run `./eval.py`) |
45
- |---|---|
46
- | 56.21 | 46.37 |
 
 
1
+ ---
2
+ title: Urdu ASR SOTA
3
+ emoji: 👨‍🎤
4
+ colorFrom: pink
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 2.8.11
8
+ app_file: App/app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ ---
12
 
13
+ # Urdu Automatic Speech Recognition State of the Art Solution
14
 
15
+ ![cover](Images/cover.jpg)
16
+ Automatic Speech Recognition using Facebook's wav2vec2-xls-r-300m model and mozilla-foundation common_voice_8_0 Urdu Dataset.
17
+
18
+ ## Model Finetunning
19
+
20
+ This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the [common_voice dataset](https://commonvoice.mozilla.org/en/datasets).
21
 
22
  It achieves the following results on the evaluation set:
23
+
24
  - Loss: 0.9889
25
  - Wer: 0.5607
26
  - Cer: 0.2370
27
 
28
+ ## Quick Prediction
 
29
 
30
+ Install all dependecies using `requirment.txt` file and then run bellow command to predict the text:
 
 
31
 
32
  ```python
33
  import torch
34
  from datasets import load_dataset, Audio
35
  from transformers import pipeline
 
36
  model = "Model"
37
  data = load_dataset("Data", "ur", split="test", delimiter="\t")
38
  def path_adjust(batch):
49
  # => {'text': 'اب یہ ونگین لمحاتانکھار دلمیں میںفوث کریلیا اجائ'}
50
  ```
51
 
52
+ ## Evaluation Commands
53
+
54
+ To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`, you can copy and past the command to the terminal.
55
+
56
+ ```bash
57
+ python3 eval.py --model_id Model --dataset Data --config ur --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
58
+ ```
59
+
60
+ **OR**
61
+ Run the simple shell script
62
+
63
+ ```bash
64
+ bash run_eval.sh
65
+ ```
66
+
67
+ ## Language Model
68
+
69
+ [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)
70
+
71
+ - Get suitable Urdu text data for a language model
72
+ - Build an n-gram with KenLM
73
+ - Combine the n-gram with a fine-tuned Wav2Vec2 checkpoint
74
+
75
+ Install kenlm and pyctcdecode before running the notebook.
76
+
77
+ ```bash
78
+ pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
79
+ ```
80
+
81
+ ## Eval Results
82
+
83
+ | Without LM | With LM |
84
+ | ---------- | ------- |
85
+ | 56.21 | 46.37 |
86
+
87
+ ## Directory Structure
88
+
89
+ ```
90
+ <root directory>
91
+ |
92
+ .- README.md
93
+ |
94
+ .- Data/
95
+ |
96
+ .- Model/
97
+ |
98
+ .- Images/
99
+ |
100
+ .- Sample/
101
+ |
102
+ .- Gradio/
103
+ |
104
+ .- Eval Results/
105
+ |
106
+ .- With LM/
107
+ |
108
+ .- Without LM/
109
+ | ...
110
+ .- notebook.ipynb
111
+ |
112
+ .- run_eval.sh
113
+ |
114
+ .- eval.py
115
+
116
+ ```
117
+
118
+ ## Gradio App
119
+
120
+ ## SOTA
121
+
122
+ - [x] Add Language Model
123
+ - [x] Webapp/API
124
+ - [] Denoise Audio
125
+ - [] Text Processing
126
+ - [] Spelling Mistakes
127
+ - [x] Hyperparameters optimization
128
+ - [] Training on 300 Epochs & 64 Batch Size
129
+ - [] Improved Language Model
130
+ - [] Contribute to Urdu ASR Audio Dataset
131
+
132
+ ## Robust Speech Recognition Challenge 2022
133
+
134
+ This project was the results of HuggingFace [Robust Speech Recognition Challenge](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614). I was one of the winner with four state of the art ASR model. Check out my SOTA checkpoints.
135
+
136
+ - **[Urdu](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)**
137
+ - **[Arabic](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-300-arabic)**
138
+ - **[Punjabi](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-53-punjabi)**
139
+ - **[Irish](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-1b-Irish)**
140
+
141
+ ![winner](Images/winner.png)
142
 
143
+ ## References
144
 
145
+ - [Common Voice Dataset](https://commonvoice.mozilla.org/en/datasets)
146
+ - [Sequence Modeling With CTC](https://distill.pub/2017/ctc/)
147
+ - [Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)
148
+ - [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)
149
+ - [HF Model](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)
Sample/sample1.mp3 ADDED
Binary file (13 kB). View file
Sample/sample2.mp3 ADDED
Binary file (16.5 kB). View file
Sample/sample3.mp3 ADDED
Binary file (26 kB). View file