Gorgarik commited on
Commit
e6cb78a
1 Parent(s): 62b023a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -13
README.md CHANGED
@@ -1,9 +1,6 @@
1
  ---
2
-
3
-
4
-
5
  language:
6
- - be
7
  library_name: nemo
8
  datasets:
9
  - mozilla-foundation/common_voice_10_0
@@ -36,9 +33,10 @@ model-index:
36
  - name: Test WER
37
  type: wer
38
  value: 3.8
39
-
40
  ---
41
- # NVIDIA Conformer-RNNT Large (be)
 
42
 
43
  <style>
44
  img {
@@ -51,21 +49,50 @@ img {
51
  | [![Language](https://img.shields.io/badge/Language-be--Belarusian-lightgrey#model-badge)](#datasets)
52
 
53
 
54
- This model transcribes speech in lowercase Belarusian alphabet including spaces and apostrophes, and is trained on few hundreds of Belarusian speech data.
55
- It is a "large" variant of Conformer, with around 120 million parameters.
56
  See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
57
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- ## Usage
60
 
61
  The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
62
 
63
- To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
64
 
 
 
 
65
  ```
66
- pip install nemo_toolkit['all']
 
 
 
 
 
 
67
  ```
68
 
 
 
 
 
 
 
 
 
 
 
69
  Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC Loss. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html).
70
 
71
  ## Training
@@ -75,12 +102,23 @@ The NeMo toolkit [3] was used for training the models for over several hundred e
75
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
76
 
77
  ### Datasets
 
78
  All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several hundreds hours of Belarusian speech:
 
79
  - Mozilla Common Voice (v10.0)
 
 
80
  ## Performance
81
- Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding. WER on dev is 3.8%
 
 
 
 
 
 
82
  ## Limitations
83
- Since all models are trained on just MCV-10 dataset, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
 
84
  ## NVIDIA Riva: Deployment
85
 
86
  [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
 
1
  ---
 
 
 
2
  language:
3
+ - en
4
  library_name: nemo
5
  datasets:
6
  - mozilla-foundation/common_voice_10_0
 
33
  - name: Test WER
34
  type: wer
35
  value: 3.8
36
+
37
  ---
38
+
39
+ # NVIDIA Conformer-Transducer Large (be-Bel)
40
 
41
  <style>
42
  img {
 
49
  | [![Language](https://img.shields.io/badge/Language-be--Belarusian-lightgrey#model-badge)](#datasets)
50
 
51
 
52
+ This model transcribes speech in lower case Belarusian alphabet along with spaces and apostrophes.
53
+ It is an "large" versions of Conformer-Transducer (around 120M parameters) model.
54
  See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
55
 
56
+ ## NVIDIA NeMo: Training
57
+
58
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
59
+ ```
60
+ pip install nemo_toolkit['all']
61
+ '''
62
+ '''
63
+ (if it causes an error):
64
+ pip install nemo_toolkit[all]
65
+ ```
66
 
67
+ ## How to Use this Model
68
 
69
  The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
70
 
71
+ ### Automatically instantiate the model
72
 
73
+ ```python
74
+ import nemo.collections.asr as nemo_asr
75
+ asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_be_conformer_transducer_large")
76
  ```
77
+
78
+ ### Transcribing many audio files
79
+
80
+ ```shell
81
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
82
+ pretrained_name="nvidia/stt_be_conformer_transducer_large"
83
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
84
  ```
85
 
86
+ ### Input
87
+
88
+ This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
89
+
90
+ ### Output
91
+
92
+ This model provides transcribed speech as a string for a given audio sample.
93
+
94
+ ## Model Architecture
95
+
96
  Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC Loss. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html).
97
 
98
  ## Training
 
102
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
103
 
104
  ### Datasets
105
+
106
  All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several hundreds hours of Belarusian speech:
107
+
108
  - Mozilla Common Voice (v10.0)
109
+
110
+
111
  ## Performance
112
+
113
+ Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
114
+
115
+ | Version | Tokenizer | Vocabulary Size | MCV 10 Test | Train Dataset |
116
+ |---------|----------------------|-----------------|-------------|---------------|
117
+ | 1.12.0 | Google Sentencepiece | 1024 | 3.8 | MCV 10 |
118
+
119
  ## Limitations
120
+ Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
121
+
122
  ## NVIDIA Riva: Deployment
123
 
124
  [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.