smajumdar94 igitman commited on
Commit
ccbc249
1 Parent(s): ad08092

Update the model (#6)

Browse files

- Update the model after dropping synthetic P&C in MLS (5229ab171d354c867f03fb73079225d56ddb5299)


Co-authored-by: Igor Gitman <igitman@users.noreply.huggingface.co>

README.md CHANGED
@@ -35,7 +35,7 @@ model-index:
35
  metrics:
36
  - name: Test WER
37
  type: wer
38
- value: 5.67
39
  - task:
40
  type: Automatic Speech Recognition
41
  name: automatic-speech-recognition
@@ -49,7 +49,7 @@ model-index:
49
  metrics:
50
  - name: Test WER
51
  type: wer
52
- value: 11.11
53
  - task:
54
  type: Automatic Speech Recognition
55
  name: speech-recognition
@@ -63,7 +63,7 @@ model-index:
63
  metrics:
64
  - name: Test WER
65
  type: wer
66
- value: 16.16
67
  - task:
68
  type: Automatic Speech Recognition
69
  name: speech-recognition
@@ -77,7 +77,7 @@ model-index:
77
  metrics:
78
  - name: Test WER P&C
79
  type: wer
80
- value: 8.14
81
  - task:
82
  type: Automatic Speech Recognition
83
  name: automatic-speech-recognition
@@ -91,7 +91,7 @@ model-index:
91
  metrics:
92
  - name: Test WER P&C
93
  type: wer
94
- value: 22.06
95
  - task:
96
  type: Automatic Speech Recognition
97
  name: speech-recognition
@@ -105,7 +105,7 @@ model-index:
105
  metrics:
106
  - name: Test WER P&C
107
  type: wer
108
- value: 19.96
109
  ---
110
  # NVIDIA FastConformer-Hybrid Large (it)
111
 
@@ -191,9 +191,9 @@ The tokenizers for these models were built using the text transcripts of the tra
191
 
192
  The model in this collection are trained on a composite dataset (NeMo PnC IT ASRSET) comprising of 487 hours of Italian speech:
193
 
194
- - Mozilla Common Voice 12.0 (Italian) - 220 hours after data cleaning
195
- - Multilingual LibriSpeech (Italian) - 214 hours after data cleaning
196
- - VoxPopuli transcribed subset (Italian) - 53 hours after data cleaning
197
 
198
  ## Performance
199
 
@@ -206,15 +206,16 @@ a) On data without Punctuation and Capitalization
206
 
207
  | Version | Tokenizer | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
208
  |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
209
- | 1.20.0 | SentencePiece BPE | 512 | 5.13% | 5.67% | 13.16% | 11.11% | 12.92% | 16.16% |
210
 
211
 
212
  b) On data with Punctuation and Capitalization
213
 
214
- | Version | Tokenizer | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
215
- |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
216
- | 1.20.0 | SentencePiece BPE | 512 | 7.66% | 8.14% | 26.48% | 22.06% | 16.91% | 19.96% |
217
 
 
218
 
219
  ## Limitations
220
  Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. The model only outputs the punctuations: ```'.', ',', '?' ``` and hence might not do well in scenarios where other punctuations are also expected.
 
35
  metrics:
36
  - name: Test WER
37
  type: wer
38
+ value: 5.64
39
  - task:
40
  type: Automatic Speech Recognition
41
  name: automatic-speech-recognition
 
49
  metrics:
50
  - name: Test WER
51
  type: wer
52
+ value: 11.39
53
  - task:
54
  type: Automatic Speech Recognition
55
  name: speech-recognition
 
63
  metrics:
64
  - name: Test WER
65
  type: wer
66
+ value: 16.22
67
  - task:
68
  type: Automatic Speech Recognition
69
  name: speech-recognition
 
77
  metrics:
78
  - name: Test WER P&C
79
  type: wer
80
+ value: 8.11
81
  - task:
82
  type: Automatic Speech Recognition
83
  name: automatic-speech-recognition
 
91
  metrics:
92
  - name: Test WER P&C
93
  type: wer
94
+ value: 18.27
95
  - task:
96
  type: Automatic Speech Recognition
97
  name: speech-recognition
 
105
  metrics:
106
  - name: Test WER P&C
107
  type: wer
108
+ value: 19.97
109
  ---
110
  # NVIDIA FastConformer-Hybrid Large (it)
111
 
 
191
 
192
  The model in this collection are trained on a composite dataset (NeMo PnC IT ASRSET) comprising of 487 hours of Italian speech:
193
 
194
+ - Mozilla Common Voice 12.0 (Italian) - 220 hours after data cleaning. [Speech Data Processor](https://github.com/NVIDIA/NeMo-speech-data-processor) config used to prepare this data is [here](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/italian/mcv/config.yaml).
195
+ - Multilingual LibriSpeech (Italian) - 214 hours after data cleaning. [Speech Data Processor](https://github.com/NVIDIA/NeMo-speech-data-processor) config used to prepare this data is [here](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/italian/mls/config.yaml).
196
+ - VoxPopuli transcribed subset (Italian) - 53 hours after data cleaning. [Speech Data Processor](https://github.com/NVIDIA/NeMo-speech-data-processor) config used to prepare this data is [here](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/italian/voxpopuli/config.yaml).
197
 
198
  ## Performance
199
 
 
206
 
207
  | Version | Tokenizer | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
208
  |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
209
+ | 1.20.0 | SentencePiece BPE | 512 | 5.19% | 5.64% | 13.01% | 11.39% | 13.02% | 16.22% |
210
 
211
 
212
  b) On data with Punctuation and Capitalization
213
 
214
+ | Version | Tokenizer | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev\* | MLS Test\* | VoxPopuli Dev | VoxPopuli Test |
215
+ |---------|-----------------------|-----------------|--------------|---------------|-----------|------------|---------------|----------------|
216
+ | 1.20.0 | SentencePiece BPE | 512 | 7.70% | 8.11% | 21.69% | 18.27% | 16.96% | 19.97% |
217
 
218
+ \* We use only a subset of dev/test sets with P&C restored from the original books
219
 
220
  ## Limitations
221
  Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. The model only outputs the punctuations: ```'.', ',', '?' ``` and hence might not do well in scenarios where other punctuations are also expected.
stt_it_fastconformer_hybrid_large_pc.nemo CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1bf97c6b148d20c10dea8f950da3359244c7fb9681994153e41cda8c276e77ea
3
  size 455505920
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6db62aeda2dd05fe99e827f734e3b94f73b59f69f2a012e46668451f292baecb
3
  size 455505920