File size: 4,470 Bytes
067ad5b
 
d4f4590
 
 
 
8657a3e
47222d2
6155c4f
8657a3e
30fe112
 
ab49101
43b8501
10e968a
 
 
 
ab49101
d0be2d1
ab49101
d0be2d1
 
ab49101
 
 
d0be2d1
 
 
 
ab49101
1bb676c
 
ab49101
 
22b22bd
 
 
 
 
 
 
 
 
ab49101
 
22b22bd
 
38bb1e5
 
b369553
982e531
 
b369553
 
38bb1e5
5d3b81e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: cc-by-nc-4.0
language:
- bn
library_name: nemo
pipeline_tag: automatic-speech-recognition
---
## Hishab BN FastConformer
__Hishab BN FastConformer__ is a [fastconformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#fast-conformer) based model trained on ~18K Hours [MegaBNSpeech]() corpus.

Details on paper: [https://aclanthology.org/2023.banglalp-1.16/](https://aclanthology.org/2023.banglalp-1.16/)

## Using method
This model can be used for transcribing Bangla audio and also can be used as pre-trained model to fine-tuning on custom datasets using [NeMo](https://github.com/NVIDIA/NeMo) framework.

### Installation
To install [NeMo](https://github.com/NVIDIA/NeMo) check NeMo documentation.

### Inferencing
[Download test_bn_fastconformer.wav](https://huggingface.co/hishab/hishab_bn_fastconformer/blob/main/test_bn_fastconformer.wav)
```py
# pip install -q 'nemo_toolkit[asr]'

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("hishab/hishab_bn_fastconformer")

auido_file = "test_bn_fastconformer.wav"
transcriptions = asr_model.transcribe([auido_file])
print(transcriptions)
# ['আজ সরকারি ছুটির দিন দেশের সব শিক্ষা প্রতিষ্ঠান সহ সরকারি আধা সরকারি স্বায়ত্তশাসিত প্রতিষ্ঠান ও ভবনে জাতীয় পতাকা অর্ধনমিত ও কালো পতাকা উত্তোলন করা হয়েছে']
```
Colab Notebook for Infer: [Bangla FastConformer Infer.ipynb](https://colab.research.google.com/drive/1J3bxXlLBgSf1zOKVKbRYu1VrbEJFLlUc?usp=sharing)

## Training Datasets

| Channels Category | Hours       |
| ----------------- | ----------- |
| News             | 17,640.00   |
| Talkshow         | 688.82      |
| Vlog             | 0.02        |
| Crime Show       | 4.08        |
| Total            | 18,332.92   |


## Training Details

For training the model, the dataset we selected comprises 17.64k hours of news chan- nel content, 688.82 hours of talk shows, 0.02 hours of vlogs, and 4.08 hours of crime shows.

## Evaluation


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64df9253cccd823564c3303b/WvMlp95z2-GXT6AYfwW8Y.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64df9253cccd823564c3303b/O2RA9TAedIv1OTqgdIap5.png)

## Citation
```
@inproceedings{nandi-etal-2023-pseudo,
    title = "Pseudo-Labeling for Domain-Agnostic {B}angla Automatic Speech Recognition",
    author = "Nandi, Rabindra Nath  and
      Menon, Mehadi  and
      Muntasir, Tareq  and
      Sarker, Sagor  and
      Muhtaseem, Quazi Sarwar  and
      Islam, Md. Tariqul  and
      Chowdhury, Shammur  and
      Alam, Firoj",
    editor = "Alam, Firoj  and
      Kar, Sudipta  and
      Chowdhury, Shammur Absar  and
      Sadeque, Farig  and
      Amin, Ruhul",
    booktitle = "Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.banglalp-1.16",
    doi = "10.18653/v1/2023.banglalp-1.16",
    pages = "152--162",
    abstract = "One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR",
}
```