vnoroozi commited on
Commit
b7f7fce
1 Parent(s): ba6fc81

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -29
README.md CHANGED
@@ -3,7 +3,19 @@ language:
3
  - en
4
  library_name: nemo
5
  datasets:
6
- - AISHELL-2
 
 
 
 
 
 
 
 
 
 
 
 
7
  thumbnail: null
8
  tags:
9
  - automatic-speech-recognition
@@ -16,54 +28,123 @@ tags:
16
  - NeMo
17
  - hf-asr-leaderboard
18
  license: cc-by-4.0
 
 
 
 
 
19
  model-index:
20
- - name: stt_zh_conformer_transducer_large
21
  results:
22
  - task:
23
  name: Automatic Speech Recognition
24
  type: automatic-speech-recognition
25
  dataset:
26
- name: AISHELL-2 Test IOS
27
- type: aishell2_test_ios
28
- config: Mandarin
29
  split: test
30
  args:
31
- language: zh
32
  metrics:
33
  - name: Test WER
34
  type: wer
35
- value: 5.3
36
  - task:
37
  type: Automatic Speech Recognition
38
  name: automatic-speech-recognition
39
  dataset:
40
- name: AISHELL-2 Test Android
41
- type: aishell2_test_android
42
- config: Mandarin
43
  split: test
44
  args:
45
- language: zh
46
  metrics:
47
  - name: Test WER
48
  type: wer
49
- value: 5.7
50
  - task:
51
  type: Automatic Speech Recognition
52
  name: automatic-speech-recognition
53
  dataset:
54
- name: AISHELL-2 Test Mic
55
- type: aishell2_test_mic
56
- config: Mandarin
57
  split: test
58
  args:
59
- language: zh
60
  metrics:
61
  - name: Test WER
62
  type: wer
63
- value: 5.6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ---
65
 
66
- # NVIDIA Conformer-Transducer Large (zh-ZH)
67
 
68
  <style>
69
  img {
@@ -72,12 +153,12 @@ img {
72
  </style>
73
 
74
  | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--Transducer-lightgrey#model-badge)](#model-architecture)
75
- | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
76
- | [![Language](https://img.shields.io/badge/Language-zh--ZH-lightgrey#model-badge)](#datasets)
77
 
78
 
79
- This model transcribes speech in Mandarin alphabet.
80
- It is a large version of Conformer-Transducer (around 120M parameters) model.
81
  See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
82
 
83
  ## NVIDIA NeMo: Training
@@ -95,7 +176,7 @@ The model is available for use in the NeMo toolkit [3], and can be used as a pre
95
 
96
  ```python
97
  import nemo.collections.asr as nemo_asr
98
- asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_zh_conformer_transducer_large")
99
  ```
100
 
101
  ### Transcribing using Python
@@ -112,7 +193,7 @@ asr_model.transcribe(['2086-149220-0033.wav'])
112
 
113
  ```shell
114
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
115
- pretrained_name="nvidia/stt_zh_conformer_transducer_large"
116
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
117
  ```
118
 
@@ -132,17 +213,33 @@ Conformer-Transducer model is an autoregressive variant of Conformer model [1] f
132
 
133
  The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
134
 
 
 
135
  ### Datasets
136
 
137
- All the models in this collection are trained on AISHELL2 [4] comprising of Mandarin speech:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
  ## Performance
140
 
141
  The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
142
 
143
- | Version | Tokenizer | Vocabulary Size | AISHELL2 Test IOS | AISHELL2 Test Android | AISHELL2 Test Mic | Train Dataset |
144
- |---------|-----------|-----------------|-------------------|-----------------------|-------------------|---------------|
145
- | 1.10.0 | Characters| 1024 | 5.3 | 5.7 | 5.6 | AISHELL-2 |
146
 
147
  ## Limitations
148
  Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
@@ -163,7 +260,6 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
163
  [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
164
  [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
165
  [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
166
- [4] [AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale](https://arxiv.org/abs/1808.10583)
167
 
168
  ## Licence
169
 
3
  - en
4
  library_name: nemo
5
  datasets:
6
+ - librispeech_asr
7
+ - fisher_corpus
8
+ - Switchboard-1
9
+ - WSJ-0
10
+ - WSJ-1
11
+ - National Singapore Corpus Part 1
12
+ - National Singapore Corpus Part 6
13
+ - vctk
14
+ - VoxPopuli (EN)
15
+ - Europarl-ASR (EN)
16
+ - Multilingual LibriSpeech (2000 hours)
17
+ - mozilla-foundation/common_voice_8_0
18
+ - MLCommons/peoples_speech
19
  thumbnail: null
20
  tags:
21
  - automatic-speech-recognition
28
  - NeMo
29
  - hf-asr-leaderboard
30
  license: cc-by-4.0
31
+ widget:
32
+ - example_title: Librispeech sample 1
33
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
34
+ - example_title: Librispeech sample 2
35
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
36
  model-index:
37
+ - name: stt_en_conformer_transducer_xlarge
38
  results:
39
  - task:
40
  name: Automatic Speech Recognition
41
  type: automatic-speech-recognition
42
  dataset:
43
+ name: LibriSpeech (clean)
44
+ type: librispeech_asr
45
+ config: clean
46
  split: test
47
  args:
48
+ language: en
49
  metrics:
50
  - name: Test WER
51
  type: wer
52
+ value: 1.62
53
  - task:
54
  type: Automatic Speech Recognition
55
  name: automatic-speech-recognition
56
  dataset:
57
+ name: LibriSpeech (other)
58
+ type: librispeech_asr
59
+ config: other
60
  split: test
61
  args:
62
+ language: en
63
  metrics:
64
  - name: Test WER
65
  type: wer
66
+ value: 3.01
67
  - task:
68
  type: Automatic Speech Recognition
69
  name: automatic-speech-recognition
70
  dataset:
71
+ name: Multilingual LibriSpeech
72
+ type: facebook/multilingual_librispeech
73
+ config: english
74
  split: test
75
  args:
76
+ language: en
77
  metrics:
78
  - name: Test WER
79
  type: wer
80
+ value: 5.32
81
+ - task:
82
+ type: Automatic Speech Recognition
83
+ name: automatic-speech-recognition
84
+ dataset:
85
+ name: Mozilla Common Voice 7.0
86
+ type: mozilla-foundation/common_voice_7_0
87
+ config: en
88
+ split: test
89
+ args:
90
+ language: en
91
+ metrics:
92
+ - name: Test WER
93
+ type: wer
94
+ value: 5.13
95
+ - task:
96
+ type: Automatic Speech Recognition
97
+ name: automatic-speech-recognition
98
+ dataset:
99
+ name: Mozilla Common Voice 8.0
100
+ type: mozilla-foundation/common_voice_8_0
101
+ config: en
102
+ split: test
103
+ args:
104
+ language: en
105
+ metrics:
106
+ - name: Test WER
107
+ type: wer
108
+ value: 6.46
109
+ - task:
110
+ type: Automatic Speech Recognition
111
+ name: automatic-speech-recognition
112
+ dataset:
113
+ name: Wall Street Journal 92
114
+ type: wsj_0
115
+ args:
116
+ language: en
117
+ metrics:
118
+ - name: Test WER
119
+ type: wer
120
+ value: 1.17
121
+ - task:
122
+ type: Automatic Speech Recognition
123
+ name: automatic-speech-recognition
124
+ dataset:
125
+ name: Wall Street Journal 93
126
+ type: wsj_1
127
+ args:
128
+ language: en
129
+ metrics:
130
+ - name: Test WER
131
+ type: wer
132
+ value: 2.05
133
+ - task:
134
+ type: Automatic Speech Recognition
135
+ name: automatic-speech-recognition
136
+ dataset:
137
+ name: National Singapore Corpus
138
+ type: nsc_part_1
139
+ args:
140
+ language: en
141
+ metrics:
142
+ - name: Test WER
143
+ type: wer
144
+ value: 5.70
145
  ---
146
 
147
+ # NVIDIA Conformer-Transducer X-Large (en-US)
148
 
149
  <style>
150
  img {
153
  </style>
154
 
155
  | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--Transducer-lightgrey#model-badge)](#model-architecture)
156
+ | [![Model size](https://img.shields.io/badge/Params-600M-lightgrey#model-badge)](#model-architecture)
157
+ | [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets)
158
 
159
 
160
+ This model transcribes speech in lower case English alphabet along with spaces and apostrophes.
161
+ It is an "extra-large" versions of Conformer-Transducer (around 600M parameters) model.
162
  See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
163
 
164
  ## NVIDIA NeMo: Training
176
 
177
  ```python
178
  import nemo.collections.asr as nemo_asr
179
+ asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_en_conformer_transducer_xlarge")
180
  ```
181
 
182
  ### Transcribing using Python
193
 
194
  ```shell
195
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
196
+ pretrained_name="nvidia/stt_en_conformer_transducer_xlarge"
197
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
198
  ```
199
 
213
 
214
  The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
215
 
216
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
217
+
218
  ### Datasets
219
 
220
+ All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
221
+
222
+ - Librispeech 960 hours of English speech
223
+ - Fisher Corpus
224
+ - Switchboard-1 Dataset
225
+ - WSJ-0 and WSJ-1
226
+ - National Speech Corpus (Part 1, Part 6)
227
+ - VCTK
228
+ - VoxPopuli (EN)
229
+ - Europarl-ASR (EN)
230
+ - Multilingual Librispeech (MLS EN) - 2,000 hrs subset
231
+ - Mozilla Common Voice (v8.0)
232
+ - People's Speech - 12,000 hrs subset
233
+
234
+ Note: older versions of the model may have trained on smaller set of datasets.
235
 
236
  ## Performance
237
 
238
  The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
239
 
240
+ | Version | Tokenizer | Vocabulary Size | LS test-other | LS test-clean | WSJ Eval92 | WSJ Dev93 | NSC Part 1 | MLS Test | MLS Dev | MCV Test 8.0 | Train Dataset |
241
+ |---------|-----------------------|-----------------|---------------|---------------|------------|-----------|-----|-------|------|----|------|
242
+ | 1.10.0 | SentencePiece Unigram | 1024 | 3.01 | 1.62 | 1.17 | 2.05 | 5.70 | 5.32 | 4.59 | 6.46 | NeMo ASRSET 3.0 |
243
 
244
  ## Limitations
245
  Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
260
  [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
261
  [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
262
  [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
 
263
 
264
  ## Licence
265