Updated readme with FastConformer Hybrid De relevant information and results

#3
Files changed (1) hide show
  1. README.md +85 -113
README.md CHANGED
@@ -1,150 +1,116 @@
1
  ---
2
  language:
3
- - en
4
  library_name: nemo
5
  datasets:
6
- - librispeech_asr
7
- - fisher_corpus
8
- - Switchboard-1
9
- - WSJ-0
10
- - WSJ-1
11
- - National-Singapore-Corpus-Part-1
12
- - National-Singapore-Corpus-Part-6
13
- - vctk
14
- - VoxPopuli-(EN)
15
- - Europarl-ASR-(EN)
16
- - Multilingual-LibriSpeech-(2000-hours)
17
- - mozilla-foundation/common_voice_8_0
18
- - MLCommons/peoples_speech
19
  thumbnail: null
20
  tags:
21
  - automatic-speech-recognition
22
  - speech
23
  - audio
24
  - Transducer
25
- - Conformer
 
26
  - Transformer
27
  - pytorch
28
  - NeMo
29
  - hf-asr-leaderboard
30
  license: cc-by-4.0
31
- widget:
32
- - example_title: Librispeech sample 1
33
- src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
34
- - example_title: Librispeech sample 2
35
- src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
36
  model-index:
37
- - name: stt_en_conformer_transducer_xlarge
38
  results:
39
  - task:
40
  name: Automatic Speech Recognition
41
  type: automatic-speech-recognition
42
  dataset:
43
- name: LibriSpeech (clean)
44
- type: librispeech_asr
45
- config: clean
46
  split: test
47
  args:
48
- language: en
49
  metrics:
50
  - name: Test WER
51
  type: wer
52
- value: 1.62
53
- - task:
54
- type: Automatic Speech Recognition
55
- name: automatic-speech-recognition
56
- dataset:
57
- name: LibriSpeech (other)
58
- type: librispeech_asr
59
- config: other
60
- split: test
61
- args:
62
- language: en
63
- metrics:
64
- - name: Test WER
65
- type: wer
66
- value: 3.01
67
  - task:
68
  type: Automatic Speech Recognition
69
  name: automatic-speech-recognition
70
  dataset:
71
  name: Multilingual LibriSpeech
72
  type: facebook/multilingual_librispeech
73
- config: english
74
  split: test
75
  args:
76
- language: en
77
  metrics:
78
  - name: Test WER
79
  type: wer
80
- value: 5.32
81
  - task:
82
  type: Automatic Speech Recognition
83
  name: automatic-speech-recognition
84
  dataset:
85
- name: Mozilla Common Voice 7.0
86
- type: mozilla-foundation/common_voice_7_0
87
- config: en
88
  split: test
89
  args:
90
- language: en
91
  metrics:
92
  - name: Test WER
93
  type: wer
94
- value: 5.13
95
  - task:
96
- type: Automatic Speech Recognition
97
- name: automatic-speech-recognition
98
  dataset:
99
- name: Mozilla Common Voice 8.0
100
- type: mozilla-foundation/common_voice_8_0
101
- config: en
102
  split: test
103
  args:
104
- language: en
105
  metrics:
106
- - name: Test WER
107
  type: wer
108
- value: 6.46
109
  - task:
110
  type: Automatic Speech Recognition
111
  name: automatic-speech-recognition
112
  dataset:
113
- name: Wall Street Journal 92
114
- type: wsj_0
115
- args:
116
- language: en
117
- metrics:
118
- - name: Test WER
119
- type: wer
120
- value: 1.17
121
- - task:
122
- type: Automatic Speech Recognition
123
- name: automatic-speech-recognition
124
- dataset:
125
- name: Wall Street Journal 93
126
- type: wsj_1
127
  args:
128
- language: en
129
  metrics:
130
- - name: Test WER
131
  type: wer
132
- value: 2.05
133
  - task:
134
  type: Automatic Speech Recognition
135
  name: automatic-speech-recognition
136
  dataset:
137
- name: National Singapore Corpus
138
- type: nsc_part_1
 
 
139
  args:
140
- language: en
141
  metrics:
142
- - name: Test WER
143
  type: wer
144
- value: 5.7
 
 
145
  ---
146
 
147
- # NVIDIA Conformer-Transducer X-Large (en-US)
148
 
149
  <style>
150
  img {
@@ -152,24 +118,20 @@ img {
152
  }
153
  </style>
154
 
155
- | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--Transducer-lightgrey#model-badge)](#model-architecture)
156
- | [![Model size](https://img.shields.io/badge/Params-600M-lightgrey#model-badge)](#model-architecture)
157
- | [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets)
158
 
159
 
160
- This model transcribes speech in lower case English alphabet along with spaces and apostrophes.
161
- It is an "extra-large" versions of Conformer-Transducer (around 600M parameters) model.
162
- See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
163
 
164
  ## NVIDIA NeMo: Training
165
 
166
  To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
167
  ```
168
  pip install nemo_toolkit['all']
169
- '''
170
- '''
171
- (if it causes an error):
172
- pip install nemo_toolkit[all]
173
  ```
174
 
175
  ## How to Use this Model
@@ -180,7 +142,7 @@ The model is available for use in the NeMo toolkit [3], and can be used as a pre
180
 
181
  ```python
182
  import nemo.collections.asr as nemo_asr
183
- asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_en_conformer_transducer_xlarge")
184
  ```
185
 
186
  ### Transcribing using Python
@@ -195,12 +157,21 @@ asr_model.transcribe(['2086-149220-0033.wav'])
195
 
196
  ### Transcribing many audio files
197
 
 
198
  ```shell
199
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
200
- pretrained_name="nvidia/stt_en_conformer_transducer_xlarge"
201
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
202
  ```
203
 
 
 
 
 
 
 
 
 
204
  ### Input
205
 
206
  This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
@@ -211,42 +182,43 @@ This model provides transcribed speech as a string for a given audio sample.
211
 
212
  ## Model Architecture
213
 
214
- Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC Loss. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html).
215
 
216
  ## Training
217
 
218
- The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
219
 
220
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
221
 
222
  ### Datasets
223
 
224
- All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
225
 
226
- - Librispeech 960 hours of English speech
227
- - Fisher Corpus
228
- - Switchboard-1 Dataset
229
- - WSJ-0 and WSJ-1
230
- - National Speech Corpus (Part 1, Part 6)
231
- - VCTK
232
- - VoxPopuli (EN)
233
- - Europarl-ASR (EN)
234
- - Multilingual Librispeech (MLS EN) - 2,000 hrs subset
235
- - Mozilla Common Voice (v8.0)
236
- - People's Speech - 12,000 hrs subset
237
-
238
- Note: older versions of the model may have trained on smaller set of datasets.
239
 
240
  ## Performance
241
 
242
- The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
 
244
- | Version | Tokenizer | Vocabulary Size | LS test-other | LS test-clean | WSJ Eval92 | WSJ Dev93 | NSC Part 1 | MLS Test | MLS Dev | MCV Test 8.0 | Train Dataset |
245
- |---------|-----------------------|-----------------|---------------|---------------|------------|-----------|-----|-------|------|----|------|
246
- | 1.10.0 | SentencePiece Unigram | 1024 | 3.01 | 1.62 | 1.17 | 2.05 | 5.70 | 5.32 | 4.59 | 6.46 | NeMo ASRSET 3.0 |
247
 
248
  ## Limitations
249
- Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
250
 
251
  ## NVIDIA Riva: Deployment
252
 
@@ -267,4 +239,4 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
267
 
268
  ## Licence
269
 
270
- License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
 
1
  ---
2
  language:
3
+ - de
4
  library_name: nemo
5
  datasets:
6
+ - multilingual_librispeech
7
+ - mozilla-foundation/common_voice_12_0
8
+ - VoxPopuli-(DE)
 
 
 
 
 
 
 
 
 
 
9
  thumbnail: null
10
  tags:
11
  - automatic-speech-recognition
12
  - speech
13
  - audio
14
  - Transducer
15
+ - FastConformer
16
+ - CTC
17
  - Transformer
18
  - pytorch
19
  - NeMo
20
  - hf-asr-leaderboard
21
  license: cc-by-4.0
 
 
 
 
 
22
  model-index:
23
+ - name: stt_de_fastconformer_hybrid_large_pc
24
  results:
25
  - task:
26
  name: Automatic Speech Recognition
27
  type: automatic-speech-recognition
28
  dataset:
29
+ name: common-voice-12-0
30
+ type: mozilla-foundation/common_voice_12_0
31
+ config: de
32
  split: test
33
  args:
34
+ language: de
35
  metrics:
36
  - name: Test WER
37
  type: wer
38
+ value: 4.93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  - task:
40
  type: Automatic Speech Recognition
41
  name: automatic-speech-recognition
42
  dataset:
43
  name: Multilingual LibriSpeech
44
  type: facebook/multilingual_librispeech
45
+ config: german
46
  split: test
47
  args:
48
+ language: de
49
  metrics:
50
  - name: Test WER
51
  type: wer
52
+ value: 3.8
53
  - task:
54
  type: Automatic Speech Recognition
55
  name: automatic-speech-recognition
56
  dataset:
57
+ name: Vox Populi
58
+ type: polinaeterna/voxpopuli
59
+ config: german
60
  split: test
61
  args:
62
+ language: de
63
  metrics:
64
  - name: Test WER
65
  type: wer
66
+ value: 8.6
67
  - task:
68
+ name: Automatic Speech Recognition
69
+ type: automatic-speech-recognition
70
  dataset:
71
+ name: common-voice-12-0
72
+ type: mozilla-foundation/common_voice_12_0
73
+ config: German P&C
74
  split: test
75
  args:
76
+ language: de
77
  metrics:
78
+ - name: Test WER P&C
79
  type: wer
80
+ value: 5.39
81
  - task:
82
  type: Automatic Speech Recognition
83
  name: automatic-speech-recognition
84
  dataset:
85
+ name: Multilingual LibriSpeech
86
+ type: facebook/multilingual_librispeech
87
+ config: German P&C
88
+ split: test
 
 
 
 
 
 
 
 
 
 
89
  args:
90
+ language: de
91
  metrics:
92
+ - name: Test WER P&C
93
  type: wer
94
+ value: 11.1
95
  - task:
96
  type: Automatic Speech Recognition
97
  name: automatic-speech-recognition
98
  dataset:
99
+ name: Vox Populi
100
+ type: polinaeterna/voxpopuli
101
+ config: German P&C
102
+ split: test
103
  args:
104
+ language: de
105
  metrics:
106
+ - name: Test WER P&C
107
  type: wer
108
+ value: 10.41
109
+ metrics:
110
+ - wer
111
  ---
112
 
113
+ # NVIDIA FastConformer-Hybrid Large (de)
114
 
115
  <style>
116
  img {
 
118
  }
119
  </style>
120
 
121
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
122
+ | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
123
+ | [![Language](https://img.shields.io/badge/Language-de-lightgrey#model-badge)](#datasets)
124
 
125
 
126
+ This model transcribes speech in upper and lower case German alphabet along with spaces, periods, commas, and question marks.
127
+ It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model.
128
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
129
 
130
  ## NVIDIA NeMo: Training
131
 
132
  To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
133
  ```
134
  pip install nemo_toolkit['all']
 
 
 
 
135
  ```
136
 
137
  ## How to Use this Model
 
142
 
143
  ```python
144
  import nemo.collections.asr as nemo_asr
145
+ asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_de_fastconformer_hybrid_large_pc")
146
  ```
147
 
148
  ### Transcribing using Python
 
157
 
158
  ### Transcribing many audio files
159
 
160
+ Using Transducer mode inference:
161
  ```shell
162
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
163
+ pretrained_name="nvidia/stt_de_fastconformer_hybrid_large_pc"
164
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
165
  ```
166
 
167
+ Using CTC mode inference:
168
+ ```shell
169
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
170
+ pretrained_name="nvidia/stt_de_fastconformer_hybrid_large_pc"
171
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
172
+ decoder_type="ctc"
173
+ ```
174
+
175
  ### Input
176
 
177
  This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
 
182
 
183
  ## Model Architecture
184
 
185
+ FastConformer is an optimized version of the Conformer model [1] with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with joint Transducer and CTC decoder loss. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) and about Hybrid Transducer-CTC training here: [Hybrid Transducer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc).
186
 
187
  ## Training
188
 
189
+ The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_transducer_ctc_bpe.yaml).
190
 
191
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
192
 
193
  ### Datasets
194
 
195
+ All the models in this collection are trained on a composite dataset (NeMo PnC ASRSET) comprising of 2500 hours of German speech:
196
 
197
+ - MCV12 (800 hrs)
198
+ - MLS (1500 hrs)
199
+ - Voxpopuli (200 hrs)
 
 
 
 
 
 
 
 
 
 
200
 
201
  ## Performance
202
 
203
+ The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.
204
+
205
+ The following tables summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
206
+
207
+
208
+ a) On data without Punctuation and Capitalization with Transducer decoder
209
+ | **Version** | **Tokenizer** | **Vocabulary Size** | **MCV12 DEV** | **MCV12 TEST** | **MLS DEV** | **MLS TEST** | **VOXPOPULI DEV** | **VOXPOPULI TEST** |
210
+ |:-----------:|:---------------------:|:-------------------:|:-------------:|:--------------:|:-----------:|:------------:|:-----------------:|:------------------:|
211
+ | 1.18.0 | SentencePiece Unigram | 1024 | 4.18 | 4.93 | 3.3 | 3.8 | 10.8 | 8.6 |
212
+
213
+
214
+ b) On data with Punctuation and Capitalization with Transducer decoder
215
+ | **Version** | **Tokenizer** | **Vocabulary Size** | **MCV12 DEV** | **MCV12 TEST** | **MLS DEV** | **MLS TEST** | **VOXPOPULI DEV** | **VOXPOPULI TEST** |
216
+ |:-----------:|:---------------------:|:-------------------:|:-------------:|:--------------:|:-----------:|:------------:|:-----------------:|:------------------:|
217
+ | 1.18.0 | SentencePiece Unigram | 1024 | 4.66 | 5.39 | 10.12 | 11.1 | 12.96 | 10.41 |
218
 
 
 
 
219
 
220
  ## Limitations
221
+ Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. The model only outputs the punctuations: ```'.', ',', '?' ``` and hence might not do well in scenarios where other punctuations are also expected.
222
 
223
  ## NVIDIA Riva: Deployment
224
 
 
239
 
240
  ## Licence
241
 
242
+ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.