nvidia
/

stt_it_fastconformer_hybrid_large_pc

@@ -35,7 +35,7 @@ model-index:
     metrics:
     - name: Test WER
       type: wer
-      value: 5.67
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
@@ -49,7 +49,7 @@ model-index:
     metrics:
     - name: Test WER
       type: wer
-      value: 11.11
   - task:
       type: Automatic Speech Recognition
       name: speech-recognition
@@ -63,7 +63,7 @@ model-index:
     metrics:
     - name: Test WER
       type: wer
-      value: 16.16
   - task:
       type: Automatic Speech Recognition
       name: speech-recognition
@@ -77,7 +77,7 @@ model-index:
     metrics:
     - name: Test WER P&C
       type: wer
-      value: 8.14
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
@@ -91,7 +91,7 @@ model-index:
     metrics:
     - name: Test WER P&C
       type: wer
-      value: 22.06
   - task:
       type: Automatic Speech Recognition
       name: speech-recognition
@@ -105,7 +105,7 @@ model-index:
     metrics:
     - name: Test WER P&C
       type: wer
-      value: 19.96
 ---
 # NVIDIA FastConformer-Hybrid Large (it)
@@ -191,9 +191,9 @@ The tokenizers for these models were built using the text transcripts of the tra
 The model in this collection are trained on a composite dataset (NeMo PnC IT ASRSET) comprising of 487 hours of Italian speech:
-- Mozilla Common Voice 12.0 (Italian) - 220 hours after data cleaning
-- Multilingual LibriSpeech (Italian) - 214 hours after data cleaning
-- VoxPopuli transcribed subset (Italian) - 53 hours after data cleaning
 ## Performance
@@ -206,15 +206,16 @@ a) On data without Punctuation and Capitalization
 | Version | Tokenizer             | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
 |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
-| 1.20.0  | SentencePiece BPE     | 512             | 5.13%        | 5.67%         | 13.16%  | 11.11%   | 12.92%        | 16.16%         |
 b) On data with Punctuation and Capitalization
-| Version | Tokenizer             | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
-|---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
-| 1.20.0  | SentencePiece BPE     | 512             | 7.66%        | 8.14%         | 26.48%  | 22.06%   | 16.91%        | 19.96%         |
 ## Limitations
 Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. The model only outputs the punctuations: ```'.', ',', '?' ``` and hence might not do well in scenarios where other punctuations are also expected.

     metrics:
     - name: Test WER
       type: wer
+      value: 5.64
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
     metrics:
     - name: Test WER
       type: wer
+      value: 11.39
   - task:
       type: Automatic Speech Recognition
       name: speech-recognition
     metrics:
     - name: Test WER
       type: wer
+      value: 16.22
   - task:
       type: Automatic Speech Recognition
       name: speech-recognition
     metrics:
     - name: Test WER P&C
       type: wer
+      value: 8.11
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
     metrics:
     - name: Test WER P&C
       type: wer
+      value: 18.27
   - task:
       type: Automatic Speech Recognition
       name: speech-recognition
     metrics:
     - name: Test WER P&C
       type: wer
+      value: 19.97
 ---
 # NVIDIA FastConformer-Hybrid Large (it)
 The model in this collection are trained on a composite dataset (NeMo PnC IT ASRSET) comprising of 487 hours of Italian speech:
+- Mozilla Common Voice 12.0 (Italian) - 220 hours after data cleaning. [Speech Data Processor](https://github.com/NVIDIA/NeMo-speech-data-processor) config used to prepare this data is [here](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/italian/mcv/config.yaml).
+- Multilingual LibriSpeech (Italian) - 214 hours after data cleaning. [Speech Data Processor](https://github.com/NVIDIA/NeMo-speech-data-processor) config used to prepare this data is [here](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/italian/mls/config.yaml).
+- VoxPopuli transcribed subset (Italian) - 53 hours after data cleaning. [Speech Data Processor](https://github.com/NVIDIA/NeMo-speech-data-processor) config used to prepare this data is [here](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/italian/voxpopuli/config.yaml).
 ## Performance
 | Version | Tokenizer             | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
 |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
+| 1.20.0  | SentencePiece BPE     | 512             | 5.19%        | 5.64%         | 13.01%  | 11.39%   | 13.02%        | 16.22%         |
 b) On data with Punctuation and Capitalization
+| Version | Tokenizer             | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev\* | MLS Test\* | VoxPopuli Dev | VoxPopuli Test |
+|---------|-----------------------|-----------------|--------------|---------------|-----------|------------|---------------|----------------|
+| 1.20.0  | SentencePiece BPE     | 512             | 7.70%        | 8.11%         | 21.69%    | 18.27%     | 16.96%        | 19.97%         |
+\* We use only a subset of dev/test sets with P&C restored from the original books
 ## Limitations
 Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. The model only outputs the punctuations: ```'.', ',', '?' ``` and hence might not do well in scenarios where other punctuations are also expected.

stt_it_fastconformer_hybrid_large_pc.nemo CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1bf97c6b148d20c10dea8f950da3359244c7fb9681994153e41cda8c276e77ea
 size 455505920

 version https://git-lfs.github.com/spec/v1
+oid sha256:6db62aeda2dd05fe99e827f734e3b94f73b59f69f2a012e46668451f292baecb
 size 455505920