Push model using huggingface_hub.
Browse files- README.md +78 -163
- indicconformer_stt_sd_hybrid_rnnt_large.nemo +1 -1
README.md
CHANGED
@@ -1,196 +1,111 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
library_name: nemo
|
4 |
-
tags:
|
5 |
-
- pytorch
|
6 |
-
- NeMo
|
7 |
---
|
8 |
|
9 |
-
|
10 |
-
|
11 |
-
<style>
|
12 |
-
img {
|
13 |
-
display: inline;
|
14 |
-
}
|
15 |
-
</style>
|
16 |
-
|
17 |
-
[![Model architecture](https://img.shields.io/badge/Model_Arch-PUT-YOUR-ARCHITECTURE-HERE-lightgrey#model-badge)](#model-architecture)
|
18 |
-
| [![Model size](https://img.shields.io/badge/Params-PUT-YOUR-MODEL-SIZE-HERE-lightgrey#model-badge)](#model-architecture)
|
19 |
-
| [![Language](https://img.shields.io/badge/Language-PUT-YOUR-LANGUAGE-HERE-lightgrey#model-badge)](#datasets)
|
20 |
-
|
21 |
-
**Put a short model description here.**
|
22 |
-
|
23 |
-
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/index.html) for complete architecture details.
|
24 |
-
|
25 |
-
|
26 |
-
## NVIDIA NeMo: Training
|
27 |
-
|
28 |
-
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
|
29 |
-
```
|
30 |
-
pip install nemo_toolkit['all']
|
31 |
-
```
|
32 |
-
|
33 |
-
## How to Use this Model
|
34 |
|
35 |
-
|
36 |
|
37 |
-
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
```python
|
42 |
-
import nemo.core import ModelPT
|
43 |
-
model = ModelPT.from_pretrained("ai4bharat/indicconformer_stt_sd_hybrid_rnnt_large")
|
44 |
-
```
|
45 |
-
|
46 |
-
### NOTE
|
47 |
-
|
48 |
-
Add some information about how to use the model here. An example is provided for ASR inference below.
|
49 |
-
|
50 |
-
### Transcribing using Python
|
51 |
-
First, let's get a sample
|
52 |
-
```
|
53 |
-
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
|
54 |
-
```
|
55 |
-
Then simply do:
|
56 |
```
|
57 |
-
|
58 |
```
|
59 |
|
60 |
-
|
61 |
-
|
62 |
-
```
|
63 |
-
python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
```
|
65 |
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
**Add some information about what are the outputs of this model**
|
73 |
-
|
74 |
-
## Model Architecture
|
75 |
-
|
76 |
-
**Add information here discussing architectural details of the model or any comments to users about the model.**
|
77 |
-
|
78 |
-
## Training
|
79 |
-
|
80 |
-
**Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**
|
81 |
-
|
82 |
-
### NOTE
|
83 |
-
|
84 |
-
An example is provided below for ASR
|
85 |
-
|
86 |
-
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).
|
87 |
-
|
88 |
-
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
89 |
-
|
90 |
-
|
91 |
-
### Datasets
|
92 |
-
|
93 |
-
**Try to provide as detailed a list of datasets as possible. If possible, provide links to the datasets on HF by adding it to the manifest section at the top of the README (marked by ---).**
|
94 |
-
|
95 |
-
### NOTE
|
96 |
-
|
97 |
-
An example for the manifest section is provided below for ASR datasets
|
98 |
-
|
99 |
-
datasets:
|
100 |
-
- librispeech_asr
|
101 |
-
- fisher_corpus
|
102 |
-
- Switchboard-1
|
103 |
-
- WSJ-0
|
104 |
-
- WSJ-1
|
105 |
-
- National-Singapore-Corpus-Part-1
|
106 |
-
- National-Singapore-Corpus-Part-6
|
107 |
-
- vctk
|
108 |
-
- voxpopuli
|
109 |
-
- europarl
|
110 |
-
- multilingual_librispeech
|
111 |
-
- mozilla-foundation/common_voice_8_0
|
112 |
-
- MLCommons/peoples_speech
|
113 |
-
|
114 |
-
The corresponding text in this section for those datasets is stated below -
|
115 |
|
116 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
|
118 |
-
|
119 |
|
120 |
-
-
|
121 |
-
- Fisher Corpus
|
122 |
-
- Switchboard-1 Dataset
|
123 |
-
- WSJ-0 and WSJ-1
|
124 |
-
- National Speech Corpus (Part 1, Part 6)
|
125 |
-
- VCTK
|
126 |
-
- VoxPopuli (EN)
|
127 |
-
- Europarl-ASR (EN)
|
128 |
-
- Multilingual Librispeech (MLS EN) - 2,000 hour subset
|
129 |
-
- Mozilla Common Voice (v7.0)
|
130 |
-
- People's Speech - 12,000 hour subset
|
131 |
|
|
|
132 |
|
133 |
-
|
134 |
|
135 |
-
|
136 |
|
137 |
-
|
|
|
138 |
|
139 |
-
|
140 |
-
|
141 |
-
model-index:
|
142 |
-
- name: PUT_MODEL_NAME
|
143 |
-
results:
|
144 |
-
- task:
|
145 |
-
name: Automatic Speech Recognition
|
146 |
-
type: automatic-speech-recognition
|
147 |
-
dataset:
|
148 |
-
name: AMI (Meetings test)
|
149 |
-
type: edinburghcstr/ami
|
150 |
-
config: ihm
|
151 |
-
split: test
|
152 |
-
args:
|
153 |
-
language: en
|
154 |
-
metrics:
|
155 |
-
- name: Test WER
|
156 |
-
type: wer
|
157 |
-
value: 17.10
|
158 |
-
- task:
|
159 |
-
name: Automatic Speech Recognition
|
160 |
-
type: automatic-speech-recognition
|
161 |
-
dataset:
|
162 |
-
name: Earnings-22
|
163 |
-
type: revdotcom/earnings22
|
164 |
-
split: test
|
165 |
-
args:
|
166 |
-
language: en
|
167 |
-
metrics:
|
168 |
-
- name: Test WER
|
169 |
-
type: wer
|
170 |
-
value: 14.11
|
171 |
|
172 |
-
|
173 |
|
174 |
-
|
175 |
|
176 |
-
|
177 |
|
178 |
-
|
179 |
|
|
|
|
|
|
|
180 |
|
181 |
-
|
182 |
|
183 |
-
|
184 |
|
|
|
185 |
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
|
186 |
|
187 |
|
188 |
-
##
|
189 |
|
190 |
-
|
191 |
|
192 |
-
|
193 |
|
194 |
-
|
195 |
-
|
196 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
{}
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
+
## IndicConformer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
+
IndicConformer is an Hybrid RNNT conformer model built for Sindhi.
|
8 |
|
9 |
+
## AI4Bharat NeMo:
|
10 |
|
11 |
+
To load, train, fine-tune or play with the model you will need to install [AI4Bharat NeMo](https://github.com/AI4Bharat/NeMo). We recommend you install it using the command shown below
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
```
|
13 |
+
git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh
|
14 |
```
|
15 |
|
16 |
+
## Usage
|
17 |
+
|
18 |
+
```bash
|
19 |
+
$ python inference.py --help
|
20 |
+
usage: inference.py [-h] -c CHECKPOINT -f AUDIO_FILEPATH -d (cpu,cuda) -l LANGUAGE_CODE
|
21 |
+
|
22 |
+
options:
|
23 |
+
-h, --help show this help message and exit
|
24 |
+
-c CHECKPOINT, --checkpoint CHECKPOINT
|
25 |
+
Path to .nemo file
|
26 |
+
-f AUDIO_FILEPATH, --audio_filepath AUDIO_FILEPATH
|
27 |
+
Audio filepath
|
28 |
+
-d (cpu,cuda), --device (cpu,cuda)
|
29 |
+
Device (cpu/gpu)
|
30 |
+
-l LANGUAGE_CODE, --language_code LANGUAGE_CODE
|
31 |
+
Language Code (eg. hi)
|
32 |
```
|
33 |
|
34 |
+
## Example command
|
35 |
+
```
|
36 |
+
python inference.py -c ai4b_indicConformer_hi.nemo -f hindi-16khz.wav -d cuda -l hi
|
37 |
+
```
|
38 |
+
Expected output -
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
+
```
|
41 |
+
Loading model..
|
42 |
+
...
|
43 |
+
Transcibing..
|
44 |
+
----------
|
45 |
+
Transcript:
|
46 |
+
Took ** seconds.
|
47 |
+
----------
|
48 |
+
```
|
49 |
|
50 |
+
### Input
|
51 |
|
52 |
+
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
+
### Output
|
55 |
|
56 |
+
This model provides transcribed speech as a string for a given audio sample.
|
57 |
|
58 |
+
## Model Architecture
|
59 |
|
60 |
+
This model is a onformer-Large model, consisting of 120M parameters, as the encoder, with a hybrid CTC-RNNT decoder. The model has 17 conformer blocks with
|
61 |
+
512 as the model dimension.
|
62 |
|
63 |
+
## Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
+
<ADD INFORMATION ABOUT HOW THE MODEL WAS TRAINED - HOW MANY EPOCHS, AMOUNT OF COMPUTE ETC>
|
66 |
|
67 |
+
### Datasets
|
68 |
|
69 |
+
<LIST THE NAME AND SPLITS OF DATASETS USED TO TRAIN THIS MODEL (ALONG WITH LANGUAGE AND ANY ADDITIONAL INFORMATION)>
|
70 |
|
71 |
+
## Performance
|
72 |
|
73 |
+
<LIST THE SCORES OF THE MODEL -
|
74 |
+
OR
|
75 |
+
USE THE Hugging Face Evaluate LiBRARY TO UPLOAD METRICS>
|
76 |
|
77 |
+
## Limitations
|
78 |
|
79 |
+
<DECLARE ANY POTENTIAL LIMITATIONS OF THE MODEL>
|
80 |
|
81 |
+
Eg:
|
82 |
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
|
83 |
|
84 |
|
85 |
+
## References
|
86 |
|
87 |
+
<ADD ANY REFERENCES HERE AS NEEDED>
|
88 |
|
89 |
+
[1] [AI4Bharat NeMo Toolkit](https://github.com/AI4Bharat/NeMo)
|
90 |
|
91 |
+
language:
|
92 |
+
- Sindhi
|
93 |
+
license: mit
|
94 |
+
library_name: nemo
|
95 |
+
datasets:
|
96 |
+
- IndicVoices
|
97 |
+
- Vistaar
|
98 |
+
- Mahadhwani
|
99 |
+
thumbnail: null
|
100 |
+
tags:
|
101 |
+
- automatic-speech-recognition
|
102 |
+
- speech
|
103 |
+
- audio
|
104 |
+
- RNNT
|
105 |
+
- HybridConformer
|
106 |
+
- Transformer
|
107 |
+
- NeMo
|
108 |
+
- pytorch
|
109 |
+
model-index:
|
110 |
+
- name: indicconformer_stt_sd_hybrid_rnnt_large
|
111 |
+
results: []
|
indicconformer_stt_sd_hybrid_rnnt_large.nemo
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 523192320
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b1448c79cc5b67497a50d402b27c6cb29ed31c471142b9cc95c1a93c72aa2356
|
3 |
size 523192320
|