Automatic Speech Recognition
NeMo
PyTorch
4 languages
automatic-speech-translation
speech
audio
Transformer
FastConformer
Conformer
NeMo
hf-asr-leaderboard
Eval Results
Files changed (1) hide show
  1. README.md +69 -12
README.md CHANGED
@@ -304,7 +304,7 @@ canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
304
 
305
  # update dcode params
306
  decode_cfg = canary_model.cfg.decoding
307
- decode_cfg.beam.beam_size = 5 # default is greedy with beam_size=1
308
  canary_model.change_decoding_strategy(decode_cfg)
309
  ```
310
 
@@ -332,10 +332,10 @@ Another recommended option is to use a json manifest as input, where each line i
332
  {
333
  "audio_filepath": "/path/to/audio.wav", # path to the audio file
334
  "duration": 10000.0, # duration of the audio
335
- "taskname": "asr", # use "s2t_translation" for AST
336
- "source_lang": "en", # Set `source_lang`=`target_lang` for ASR, choices=['en','de','es','fr']
337
- "target_lang": "de", # choices=['en','de','es','fr']
338
- "pnc": yes, # whether to have PnC output, choices=['yes', 'no']
339
  }
340
  ```
341
 
@@ -367,7 +367,7 @@ An example manifest for transcribing English audios can be:
367
  "taskname": "asr",
368
  "source_lang": "en",
369
  "target_lang": "en",
370
- "pnc": yes, # whether to have PnC output, choices=['yes', 'no']
371
  }
372
  ```
373
 
@@ -381,10 +381,10 @@ An example manifest for transcribing English audios into German text can be:
381
  {
382
  "audio_filepath": "/path/to/audio.wav", # path to the audio file
383
  "duration": 10000.0, # duration of the audio
384
- "taskname": "s2t_translation",
385
  "source_lang": "en",
386
  "target_lang": "de",
387
- "pnc": yes, # whether to have PnC output, choices=['yes', 'no']
388
  }
389
  ```
390
 
@@ -401,7 +401,8 @@ The model outputs the transcribed/translated text corresponding to the input aud
401
 
402
  ## Training
403
 
404
- Canary-1B is trained using the NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs in 24 hrs. The model can be trained using this example script and base config.
 
405
 
406
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
407
 
@@ -410,6 +411,38 @@ The tokenizers for these models were built using the text transcripts of the tra
410
 
411
  The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
412
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
413
 
414
  ## Performance
415
 
@@ -417,23 +450,47 @@ In both ASR and AST experiments, predictions were generated using beam search wi
417
 
418
  ### ASR Performance (w/o PnC)
419
 
420
- The ASR performance is measured with word error rate (WER) on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test sets on four languages, and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
421
 
 
422
 
423
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
424
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|
425
  | 1.23.0 | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
426
 
427
 
 
 
 
 
 
 
 
428
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
429
 
430
  ### AST Performance
431
 
432
- We evaluate AST performance with BLEU score on the [FLEURS](https://huggingface.co/datasets/google/fleurs) test sets on four languages and use their native annotations with punctuation and capitalization.
 
 
433
 
434
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
435
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
436
- | 1.23.0 | canary-1b | 22.66 | 41.11 | 40.76 | 32.64 | 32.15 | 23.57 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437
 
438
 
439
  ## NVIDIA Riva: Deployment
 
304
 
305
  # update dcode params
306
  decode_cfg = canary_model.cfg.decoding
307
+ decode_cfg.beam.beam_size = 1
308
  canary_model.change_decoding_strategy(decode_cfg)
309
  ```
310
 
 
332
  {
333
  "audio_filepath": "/path/to/audio.wav", # path to the audio file
334
  "duration": 10000.0, # duration of the audio
335
+ "taskname": "asr", # use "ast" for speech-to-text translation
336
+ "source_lang": "en", # Set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
337
+ "target_lang": "en", # Language of the text output, choices=['en','de','es','fr']
338
+ "pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
339
  }
340
  ```
341
 
 
367
  "taskname": "asr",
368
  "source_lang": "en",
369
  "target_lang": "en",
370
+ "pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
371
  }
372
  ```
373
 
 
381
  {
382
  "audio_filepath": "/path/to/audio.wav", # path to the audio file
383
  "duration": 10000.0, # duration of the audio
384
+ "taskname": "ast",
385
  "source_lang": "en",
386
  "target_lang": "de",
387
+ "pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
388
  }
389
  ```
390
 
 
401
 
402
  ## Training
403
 
404
+ Canary-1B is trained using the NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs.
405
+ The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/canary-2/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/canary-2/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
406
 
407
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
408
 
 
411
 
412
  The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
413
 
414
+ The constituents of public data are as follows.
415
+
416
+ #### English (25.5k hours)
417
+ - Librispeech 960 hours
418
+ - Fisher Corpus
419
+ - Switchboard-1 Dataset
420
+ - WSJ-0 and WSJ-1
421
+ - National Speech Corpus (Part 1, Part 6)
422
+ - VCTK
423
+ - VoxPopuli (EN)
424
+ - Europarl-ASR (EN)
425
+ - Multilingual Librispeech (MLS EN) - 2,000 hour subset
426
+ - Mozilla Common Voice (v7.0)
427
+ - People's Speech - 12,000 hour subset
428
+ - Mozilla Common Voice (v11.0) - 1,474 hour subset
429
+
430
+ #### German (2.5k hours)
431
+ - Mozilla Common Voice (v12.0) - 800 hour subset
432
+ - Multilingual Librispeech (MLS DE) - 1,500 hour subset
433
+ - VoxPopuli (DE) - 200 hr subset
434
+
435
+ #### Spanish (1.4k hours)
436
+ - Mozilla Common Voice (v12.0) - 395 hour subset
437
+ - Multilingual Librispeech (MLS ES) - 780 hour subset
438
+ - VoxPopuli (ES) - 108 hour subset
439
+ - Fisher - 141 hour subset
440
+
441
+ #### French (1.8k hours)
442
+ - Mozilla Common Voice (v12.0) - 708 hour subset
443
+ - Multilingual Librispeech (MLS FR) - 926 hour subset
444
+ - VoxPopuli (FR) - 165 hour subset
445
+
446
 
447
  ## Performance
448
 
 
450
 
451
  ### ASR Performance (w/o PnC)
452
 
453
+ The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
454
 
455
+ WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
456
 
457
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
458
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|
459
  | 1.23.0 | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
460
 
461
 
462
+ WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
463
+
464
+ | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
465
+ |:---------:|:-----------:|:------:|:------:|:------:|:------:|
466
+ | 1.23.0 | canary-1b | 3.06 | 4.19 | 3.15 | 4.12 |
467
+
468
+
469
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
470
 
471
  ### AST Performance
472
 
473
+ We evaluate AST performance with BLEU score and use their native annotations with punctuation and capitalization.
474
+
475
+ BLEU score on [FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
476
 
477
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
478
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
479
+ | 1.23.0 | canary-1b | 22.66 | 41.11 | 40.76 | 32.64 | 32.15 | 23.57 |
480
+
481
+
482
+ BLEU score on [COVOST-v2](https://github.com/facebookresearch/covost) test set:
483
+
484
+ | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
485
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|
486
+ | 1.23.0 | canary-1b | 37.67 | 40.7 | 40.42 |
487
+
488
+ BLEU score on [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
489
+
490
+ | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
491
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|
492
+ | 1.23.0 | canary-1b | 23.84 | 35.74 | 28.29 |
493
+
494
 
495
 
496
  ## NVIDIA Riva: Deployment