Spaces:

kingabzpro
/

Urdu-ASR-SOTA

Running

App Files Files Community

Abid commited on Mar 18, 2022

Commit

31a2efa

•

1 Parent(s): a61ebcb

added app

Browse files

Files changed (7) hide show

Gradio/app.py +83 -0
Images/cover.jpg +0 -0
Images/winner.png +0 -0
README.md +117 -14
Sample/sample1.mp3 +0 -0
Sample/sample2.mp3 +0 -0
Sample/sample3.mp3 +0 -0

Gradio/app.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import os
+from datasets import load_dataset, Audio
+from transformers import pipeline
+import gradio as gr
+############### HF ###########################
+HF_TOKEN = os.getenv("HF_TOKEN")
+hf_writer = gr.HuggingFaceDatasetSaver(HF_TOKEN, "Urdu-ASR-flags")
+############## DVC ################################
+PROD_MODEL_PATH = "Model"
+if os.path.isdir(".dvc"):
+    print("Running DVC")
+    os.system("dvc config cache.type copy")
+    os.system("dvc config core.no_scm true")
+    if os.system(f"dvc pull {PROD_MODEL_PATH}") != 0:
+        exit("dvc pull failed")
+    os.system("rm -r .dvc")
+# .apt/usr/lib/dvc
+############## Inference ##############################
+def asr(audio):
+    asr = pipeline("automatic-speech-recognition", model=model)
+    prediction = asr(audio, chunk_length_s=5, stride_length_s=1)
+    return prediction
+################### Gradio Web APP ################################
+title = "Urdu Automatic Speech Recognition"
+description = """
+<p>
+<center>
+Savta Depth is a collaborative Open Source Data Science project for monocular depth estimation - Turn 2d photos into 3d photos. To test the model and code please check out the link bellow.
+<img src="https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu/resolve/main/Image/cover.jpg" alt="logo" width="250"/>
+</center>
+</p>
+"""
+article = "<p style='text-align: center'><a href='https://dagshub.com/OperationSavta/SavtaDepth' target='_blank'>SavtaDepth Project from OperationSavta</a></p><p style='text-align: center'><a href='https://colab.research.google.com/drive/1XU4DgQ217_hUMU1dllppeQNw3pTRlHy1?usp=sharing' target='_blank'>Google Colab Demo</a></p></center></p>"
+examples = [["Sample/sample1.mp3"], ["Sample/sample2.mp3"], ["Sample/sample3.mp3"]]
+Input = gr.inputs.Audio(
+    source="microphone",
+    type="filepath",
+    optional=True,
+    label="Please Record Your Voice",
+)
+Output = gr.outputs.Textbox(label="Urdu Script")
+def main():
+    iface = gr.Interface(
+        asr,
+        Input,
+        Output,
+        title=title,
+        flagging_options=["incorrect", "worst", "ambiguous"],
+        allow_flagging="manual",
+        flagging_callback=hf_writer,
+        # description=description,
+        article=article,
+        examples=examples,
+        theme="peach",
+    )
+    iface.launch(enable_queue=True)
+# enable_queue=True,auth=("admin", "pass1234")
+if __name__ == "__main__":
+    main()

Images/cover.jpg ADDED Viewed

Images/winner.png ADDED Viewed

README.md CHANGED Viewed

@@ -1,27 +1,38 @@
-# Urdu-ASR-SOTA
-Automatic Speech Recognition using Facebook wav2vec2-xls-r-300m model and mozilla-foundation common_voice_8_0 Urdu Dataset.
-## wav2vec2-large-xls-r-300m-Urdu
-This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.9889
 - Wer: 0.5607
 - Cer: 0.2370
-#### Evaluation Commands
-To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`
-```bash
-python3 ./eval.py --model_id ./Model --dataset ./Data --config ur --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
-```
 ```python
 import torch
 from datasets import load_dataset, Audio
 from transformers import pipeline
-import torchaudio.functional as F
 model = "Model"
 data = load_dataset("Data", "ur", split="test", delimiter="\t")
 def path_adjust(batch):
@@ -38,9 +49,101 @@ prediction
 # => {'text': 'اب یہ ونگین لمحاتانکھار دلمیں میںفوث کریلیا اجائ'}
 ```
-### Eval results on Common Voice 8 "test" (WER):
-| Without LM | With LM (run `./eval.py`) |
-|---|---|
-| 56.21 | 46.37 |

+---
+title: Urdu ASR SOTA
+emoji: 👨‍🎤
+colorFrom: pink
+colorTo: blue
+sdk: gradio
+sdk_version: 2.8.11
+app_file: App/app.py
+pinned: false
+license: apache-2.0
+---
+# Urdu Automatic Speech Recognition State of the Art Solution
+![cover](Images/cover.jpg)
+Automatic Speech Recognition using Facebook's wav2vec2-xls-r-300m model and mozilla-foundation common_voice_8_0 Urdu Dataset.
+## Model Finetunning
+This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the [common_voice dataset](https://commonvoice.mozilla.org/en/datasets).
 It achieves the following results on the evaluation set:
 - Loss: 0.9889
 - Wer: 0.5607
 - Cer: 0.2370
+## Quick Prediction
+Install all dependecies using `requirment.txt` file and then run bellow command to predict the text:
 ```python
 import torch
 from datasets import load_dataset, Audio
 from transformers import pipeline
 model = "Model"
 data = load_dataset("Data", "ur", split="test", delimiter="\t")
 def path_adjust(batch):
 # => {'text': 'اب یہ ونگین لمحاتانکھار دلمیں میںفوث کریلیا اجائ'}
 ```
+## Evaluation Commands
+To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`, you can copy and past the command to the terminal.
+```bash
+python3 eval.py --model_id Model --dataset Data --config ur --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
+```
+**OR**
+Run the simple shell script
+```bash
+bash run_eval.sh
+```
+## Language Model
+[Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)
+- Get suitable Urdu text data for a language model
+- Build an n-gram with KenLM
+- Combine the n-gram with a fine-tuned Wav2Vec2 checkpoint
+Install kenlm and pyctcdecode before running the notebook.
+```bash
+pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
+```
+## Eval Results
+| Without LM | With LM |
+| ---------- | ------- |
+| 56.21      | 46.37   |
+## Directory Structure
+```
+<root directory>
+    |
+    .- README.md
+    |
+    .- Data/
+    |
+    .- Model/
+    |
+    .- Images/
+    |
+    .- Sample/
+    |
+    .- Gradio/
+    |
+    .- Eval Results/
+          |
+          .- With LM/
+          |
+          .- Without LM/
+          | ...
+    .- notebook.ipynb
+    |
+    .- run_eval.sh
+    |
+    .- eval.py
+```
+## Gradio App
+## SOTA
+- [x] Add Language Model
+- [x] Webapp/API
+- [] Denoise Audio
+- [] Text Processing
+- [] Spelling Mistakes
+- [x] Hyperparameters optimization
+- [] Training on 300 Epochs & 64 Batch Size
+- [] Improved Language Model
+- [] Contribute to Urdu ASR Audio Dataset
+## Robust Speech Recognition Challenge 2022
+This project was the results of HuggingFace [Robust Speech Recognition Challenge](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614). I was one of the winner with four state of the art ASR model. Check out my SOTA checkpoints.
+- **[Urdu](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)**
+- **[Arabic](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-300-arabic)**
+- **[Punjabi](https://huggingface.co/kingabzpro/wav2vec2-large-xlsr-53-punjabi)**
+- **[Irish](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-1b-Irish)**
+![winner](Images/winner.png)
+## References
+- [Common Voice Dataset](https://commonvoice.mozilla.org/en/datasets)
+- [Sequence Modeling With CTC](https://distill.pub/2017/ctc/)
+- [Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)
+- [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram)
+- [HF Model](https://huggingface.co/kingabzpro/wav2vec2-large-xls-r-300m-Urdu)

Sample/sample1.mp3 ADDED Viewed

Binary file (13 kB). View file

Sample/sample2.mp3 ADDED Viewed

Binary file (16.5 kB). View file

Sample/sample3.mp3 ADDED Viewed

Binary file (26 kB). View file