nvidia
/

stt_ru_conformer_transducer_large

@@ -105,6 +105,68 @@ img {
 | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
 | [![Language](https://img.shields.io/badge/Language-ru-lightgrey#model-badge)](#datasets)
-This model transcribes speech into lowercase Cyrillic alphabet including space, and is trained on around 1500 hours of Russian speech data.
 It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
-See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.

 | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
 | [![Language](https://img.shields.io/badge/Language-ru-lightgrey#model-badge)](#datasets)
+This model transcribes speech into lowercase Cyrillic alphabet including space, and is trained on around 1636 hours of Russian speech data.
 It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
+See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
+## Usage
+The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
+To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
+```
+pip install nemo_toolkit['all']
+```
+### Automatically instantiate the model
+```python
+import nemo.collections.asr as nemo_asr
+asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_ru_conformer_transducer_large")
+```
+### Transcribing using Python
+Simply do:
+```
+asr_model.transcribe(['<your_audio>.wav'])
+```
+### Transcribing many audio files
+```shell
+python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
+ pretrained_name="nvidia/stt_ru_conformer_transducer_large"
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
+```
+### Input
+This model accepts 16 kHz mono-channel Audio (wav files) as input.
+### Output
+This model provides transcribed speech as a string for a given audio sample.
+## Model Architecture
+Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html).
+## Training
+The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
+The vocabulary we use contains 33 characters:
+```python
+[' ', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я']```
+Rare symbols with diacritics were replaced during preprocessing.
+The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
+### Datasets
+All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of more than a thousand hours of Russian speech:
+- Mozilla Common Voice 10.0 (Russian) - train subset [28 hours]
+- Golos - crowd [1070 hours] and fairfield [111 hours] subsets
+- Russian LibriSpeech (RuLS) [92 hours]
+- SOVA - RuAudiobooksDevices [260 hours] and RuDevices [75 hours] subsets