Automatic Speech Recognition
NeMo
PyTorch
English
speech
streaming
audio
Transducer
Conformer
CTC
NeMo
Eval Results
vnoroozi commited on
Commit
54a6e08
1 Parent(s): 4fb986e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -112
README.md CHANGED
@@ -65,7 +65,7 @@ img {
65
  | [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets)
66
 
67
  This collection contains large size versions of cache-aware FastConformer-Hybrid (around 114M parameters) with multiple look-ahead support, trained on a large scale english speech.
68
- These models are trained for streaming ASR which be used for streaming applications with a variety of latencies.
69
  All models are hybrid with both Transducer and CTC decoders.
70
  See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
71
 
@@ -120,8 +120,7 @@ The list of the available models in this collection is shown in the following ta
120
  ## How to Use this Model
121
 
122
  The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for streaming or for fine-tuning on another dataset.
123
-
124
- To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
125
  ```
126
  pip install nemo_toolkit['all']
127
  '''
@@ -130,18 +129,36 @@ pip install nemo_toolkit['all']
130
  pip install nemo_toolkit[all]
131
  ```
132
 
 
133
 
134
  You may use this script to simulate streaming ASR with these models: [cache-aware streaming simulation](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py).
135
- You may use --att_context_size to set the context size otherwise the default which is the first context size in the list is going to be used.
 
 
 
 
 
136
 
137
- ### Automatically load the model from NGC
 
 
138
 
139
- ```python
 
 
140
  import nemo.collections.asr as nemo_asr
141
  asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="stt_en_fastconformer_hybrid_large_streaming_multi")
 
 
 
 
 
 
 
 
142
  ```
143
 
144
- ### Transcribing text with this model
145
 
146
  Using Transducer mode inference:
147
 
@@ -168,6 +185,7 @@ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
168
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
169
  att_context_size=[70,0]
170
  ```
 
171
 
172
  ### Input
173
 
@@ -177,108 +195,6 @@ This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
177
 
178
  This model provides transcribed speech as a string for a given audio sample.
179
 
180
- ## Limitations
181
-
182
- Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
183
-
184
-
185
- ## References
186
-
187
-
188
- [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
189
-
190
- [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
191
-
192
- [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
193
-
194
- [4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
195
-
196
-
197
-
198
- ## NVIDIA NeMo: Training
199
-
200
- To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
201
- ```
202
- pip install nemo_toolkit['all']
203
- '''
204
- '''
205
- (if it causes an error):
206
- pip install nemo_toolkit[all]
207
- ```
208
-
209
- ## How to Use this Model
210
-
211
- The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
212
-
213
- ### Automatically instantiate the model
214
-
215
- ```python
216
- import nemo.collections.asr as nemo_asr
217
- asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_en_conformer_transducer_xlarge")
218
- ```
219
-
220
- ### Transcribing using Python
221
- First, let's get a sample
222
- ```
223
- wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
224
- ```
225
- Then simply do:
226
- ```
227
- asr_model.transcribe(['2086-149220-0033.wav'])
228
- ```
229
-
230
- ### Transcribing many audio files
231
-
232
- ```shell
233
- python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
234
- pretrained_name="nvidia/stt_en_conformer_transducer_xlarge"
235
- audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
236
- ```
237
-
238
- ### Input
239
-
240
- This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
241
-
242
- ### Output
243
-
244
- This model provides transcribed speech as a string for a given audio sample.
245
-
246
- ## Model Architecture
247
-
248
- Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC Loss. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html).
249
-
250
- ## Training
251
-
252
- The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
253
-
254
- The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
255
-
256
- ### Datasets
257
-
258
- All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
259
-
260
- - Librispeech 960 hours of English speech
261
- - Fisher Corpus
262
- - Switchboard-1 Dataset
263
- - WSJ-0 and WSJ-1
264
- - National Speech Corpus (Part 1, Part 6)
265
- - VCTK
266
- - VoxPopuli (EN)
267
- - Europarl-ASR (EN)
268
- - Multilingual Librispeech (MLS EN) - 2,000 hrs subset
269
- - Mozilla Common Voice (v8.0)
270
- - People's Speech - 12,000 hrs subset
271
-
272
- Note: older versions of the model may have trained on smaller set of datasets.
273
-
274
- ## Performance
275
-
276
- The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
277
-
278
- | Version | Tokenizer | Vocabulary Size | LS test-other | LS test-clean | WSJ Eval92 | WSJ Dev93 | NSC Part 1 | MLS Test | MLS Dev | MCV Test 8.0 | Train Dataset |
279
- |---------|-----------------------|-----------------|---------------|---------------|------------|-----------|-----|-------|------|----|------|
280
- | 1.10.0 | SentencePiece Unigram | 1024 | 3.01 | 1.62 | 1.17 | 2.05 | 5.70 | 5.32 | 4.59 | 6.46 | NeMo ASRSET 3.0 |
281
-
282
  ## Limitations
283
  Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
284
 
@@ -295,10 +211,12 @@ Although this model isn’t supported yet by Riva, the [list of supported models
295
  Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
296
 
297
  ## References
 
 
298
  [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
 
299
  [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
300
- [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
301
 
302
- ## Licence
303
 
304
- License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
 
65
  | [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets)
66
 
67
  This collection contains large size versions of cache-aware FastConformer-Hybrid (around 114M parameters) with multiple look-ahead support, trained on a large scale english speech.
68
+ These models are trained for streaming ASR which be used for streaming applications with a variety of latencies (0ms, 80ms, 480s, 1040ms).
69
  All models are hybrid with both Transducer and CTC decoders.
70
  See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
71
 
 
120
  ## How to Use this Model
121
 
122
  The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for streaming or for fine-tuning on another dataset.
123
+ You will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
 
124
  ```
125
  pip install nemo_toolkit['all']
126
  '''
 
129
  pip install nemo_toolkit[all]
130
  ```
131
 
132
+ ### Simulate Streaming ASR
133
 
134
  You may use this script to simulate streaming ASR with these models: [cache-aware streaming simulation](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py).
135
+ You may use --att_context_size to set the context size otherwise the default which is the first context size in the list (1040ms) is going to be used.
136
+
137
+ ### Transcribing using Python
138
+ Cache-aware models are designed in a way that the model's predictions are the same in both offline and streaming mode.
139
+
140
+ So you may use the regular transcribe function to get the transcriptions. First, let's get a sample:
141
 
142
+ ```
143
+ wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
144
+ ```
145
 
146
+ Then simply do:
147
+
148
+ ```
149
  import nemo.collections.asr as nemo_asr
150
  asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="stt_en_fastconformer_hybrid_large_streaming_multi")
151
+
152
+ #Optional: change the default latency. Default latency is 1040ms. Supported latencies: {0: 0ms, 1: 80ms, 16: 480ms, 33: 1040ms}.
153
+ asr_model.encoder.set_default_att_context_size(33)
154
+
155
+ #Optional: change the default decoder. Default decoder is Transducer (RNNT). Supported decoders: {ctc, rnnt}.
156
+ asr_model.change_decoding_strategy(decoder_type='rnnt')
157
+
158
+ asr_model.transcribe(['2086-149220-0033.wav'])
159
  ```
160
 
161
+ ### Transcribing many audio files
162
 
163
  Using Transducer mode inference:
164
 
 
185
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
186
  att_context_size=[70,0]
187
  ```
188
+ Supported values for att_context_size: {[70,0]: 0ms, [70,1]: 80ms, [70,16]: 480ms, [70,33]: 1040ms}.
189
 
190
  ### Input
191
 
 
195
 
196
  This model provides transcribed speech as a string for a given audio sample.
197
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
198
  ## Limitations
199
  Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
200
 
 
211
  Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
212
 
213
  ## References
214
+
215
+
216
  [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
217
+
218
  [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
 
219
 
220
+ [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
221
 
222
+ [4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)