sasha-meister commited on
Commit
3e338bb
·
1 Parent(s): 19fabce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -2
README.md CHANGED
@@ -105,6 +105,68 @@ img {
105
  | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
106
  | [![Language](https://img.shields.io/badge/Language-ru-lightgrey#model-badge)](#datasets)
107
 
108
- This model transcribes speech into lowercase Cyrillic alphabet including space, and is trained on around 1500 hours of Russian speech data.
109
  It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
110
- See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
106
  | [![Language](https://img.shields.io/badge/Language-ru-lightgrey#model-badge)](#datasets)
107
 
108
+ This model transcribes speech into lowercase Cyrillic alphabet including space, and is trained on around 1636 hours of Russian speech data.
109
  It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
110
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
111
+
112
+ ## Usage
113
+
114
+ The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
115
+
116
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
117
+
118
+ ```
119
+ pip install nemo_toolkit['all']
120
+ ```
121
+
122
+ ### Automatically instantiate the model
123
+
124
+ ```python
125
+ import nemo.collections.asr as nemo_asr
126
+ asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_ru_conformer_transducer_large")
127
+ ```
128
+
129
+ ### Transcribing using Python
130
+ Simply do:
131
+ ```
132
+ asr_model.transcribe(['<your_audio>.wav'])
133
+ ```
134
+
135
+ ### Transcribing many audio files
136
+
137
+ ```shell
138
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
139
+ pretrained_name="nvidia/stt_ru_conformer_transducer_large"
140
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
141
+ ```
142
+
143
+ ### Input
144
+
145
+ This model accepts 16 kHz mono-channel Audio (wav files) as input.
146
+
147
+ ### Output
148
+
149
+ This model provides transcribed speech as a string for a given audio sample.
150
+
151
+ ## Model Architecture
152
+
153
+ Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html).
154
+
155
+ ## Training
156
+
157
+ The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
158
+
159
+ The vocabulary we use contains 33 characters:
160
+ ```python
161
+ [' ', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я']```
162
+ Rare symbols with diacritics were replaced during preprocessing.
163
+
164
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
165
+
166
+ ### Datasets
167
+ All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of more than a thousand hours of Russian speech:
168
+
169
+ - Mozilla Common Voice 10.0 (Russian) - train subset [28 hours]
170
+ - Golos - crowd [1070 hours] and fairfield [111 hours] subsets
171
+ - Russian LibriSpeech (RuLS) [92 hours]
172
+ - SOVA - RuAudiobooksDevices [260 hours] and RuDevices [75 hours] subsets