Commit
·
3e338bb
1
Parent(s):
19fabce
Update README.md
Browse files
README.md
CHANGED
@@ -105,6 +105,68 @@ img {
|
|
105 |
| [](#model-architecture)
|
106 |
| [](#datasets)
|
107 |
|
108 |
-
This model transcribes speech into lowercase Cyrillic alphabet including space, and is trained on around
|
109 |
It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
|
110 |
-
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
| [](#model-architecture)
|
106 |
| [](#datasets)
|
107 |
|
108 |
+
This model transcribes speech into lowercase Cyrillic alphabet including space, and is trained on around 1636 hours of Russian speech data.
|
109 |
It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
|
110 |
+
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
|
111 |
+
|
112 |
+
## Usage
|
113 |
+
|
114 |
+
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
115 |
+
|
116 |
+
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|
117 |
+
|
118 |
+
```
|
119 |
+
pip install nemo_toolkit['all']
|
120 |
+
```
|
121 |
+
|
122 |
+
### Automatically instantiate the model
|
123 |
+
|
124 |
+
```python
|
125 |
+
import nemo.collections.asr as nemo_asr
|
126 |
+
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_ru_conformer_transducer_large")
|
127 |
+
```
|
128 |
+
|
129 |
+
### Transcribing using Python
|
130 |
+
Simply do:
|
131 |
+
```
|
132 |
+
asr_model.transcribe(['<your_audio>.wav'])
|
133 |
+
```
|
134 |
+
|
135 |
+
### Transcribing many audio files
|
136 |
+
|
137 |
+
```shell
|
138 |
+
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
|
139 |
+
pretrained_name="nvidia/stt_ru_conformer_transducer_large"
|
140 |
+
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
|
141 |
+
```
|
142 |
+
|
143 |
+
### Input
|
144 |
+
|
145 |
+
This model accepts 16 kHz mono-channel Audio (wav files) as input.
|
146 |
+
|
147 |
+
### Output
|
148 |
+
|
149 |
+
This model provides transcribed speech as a string for a given audio sample.
|
150 |
+
|
151 |
+
## Model Architecture
|
152 |
+
|
153 |
+
Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html).
|
154 |
+
|
155 |
+
## Training
|
156 |
+
|
157 |
+
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
|
158 |
+
|
159 |
+
The vocabulary we use contains 33 characters:
|
160 |
+
```python
|
161 |
+
[' ', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я']```
|
162 |
+
Rare symbols with diacritics were replaced during preprocessing.
|
163 |
+
|
164 |
+
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
165 |
+
|
166 |
+
### Datasets
|
167 |
+
All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of more than a thousand hours of Russian speech:
|
168 |
+
|
169 |
+
- Mozilla Common Voice 10.0 (Russian) - train subset [28 hours]
|
170 |
+
- Golos - crowd [1070 hours] and fairfield [111 hours] subsets
|
171 |
+
- Russian LibriSpeech (RuLS) [92 hours]
|
172 |
+
- SOVA - RuAudiobooksDevices [260 hours] and RuDevices [75 hours] subsets
|