nvidia
/

canary-1b

@@ -304,7 +304,7 @@ canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
 # update dcode params
 decode_cfg = canary_model.cfg.decoding
-decode_cfg.beam.beam_size = 5  # default is greedy with beam_size=1
 canary_model.change_decoding_strategy(decode_cfg)
 ```
@@ -332,10 +332,10 @@ Another recommended option is to use a json manifest as input, where each line i
 {
     "audio_filepath": "/path/to/audio.wav",  # path to the audio file
     "duration": 10000.0,  # duration of the audio
-    "taskname": "asr",  # use "s2t_translation" for AST
-    "source_lang": "en",  # Set `source_lang`=`target_lang` for ASR, choices=['en','de','es','fr']
-    "target_lang": "de",  # choices=['en','de','es','fr']
-    "pnc": yes,  # whether to have PnC output, choices=['yes', 'no']
 }
 ```
@@ -367,7 +367,7 @@ An example manifest for transcribing English audios can be:
     "taskname": "asr",
     "source_lang": "en",
     "target_lang": "en",
-    "pnc": yes,  # whether to have PnC output, choices=['yes', 'no']
 }
 ```
@@ -381,10 +381,10 @@ An example manifest for transcribing English audios into German text can be:
 {
     "audio_filepath": "/path/to/audio.wav",  # path to the audio file
     "duration": 10000.0,  # duration of the audio
-    "taskname": "s2t_translation",
     "source_lang": "en",
     "target_lang": "de",
-    "pnc": yes,  # whether to have PnC output, choices=['yes', 'no']
 }
 ```
@@ -401,7 +401,8 @@ The model outputs the transcribed/translated text corresponding to the input aud
 ## Training
-Canary-1B is trained using the  NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs in 24 hrs. The model can be trained using this example script and base config.
 The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
@@ -410,6 +411,38 @@ The tokenizers for these models were built using the text transcripts of the tra
 The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
 ## Performance
@@ -417,23 +450,47 @@ In both ASR and AST experiments, predictions were generated using beam search wi
 ### ASR Performance (w/o PnC)
-The ASR performance is measured with word error rate (WER) on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test sets on four languages, and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
 | **Version** | **Model**     | **En**   | **De**   | **Es**   | **Fr**   |
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|
 | 1.23.0  | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
 More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
 ### AST Performance
-We evaluate AST performance with BLEU score on the [FLEURS](https://huggingface.co/datasets/google/fleurs) test sets on four languages and use their native annotations with punctuation and capitalization.
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
-| 1.23.0      | canary-1b | 22.66	      | 41.11      | 40.76      | 32.64      | 32.15      | 23.57      |
 ## NVIDIA Riva: Deployment

 # update dcode params
 decode_cfg = canary_model.cfg.decoding
+decode_cfg.beam.beam_size = 1
 canary_model.change_decoding_strategy(decode_cfg)
 ```
 {
     "audio_filepath": "/path/to/audio.wav",  # path to the audio file
     "duration": 10000.0,  # duration of the audio
+    "taskname": "asr",  # use "ast" for speech-to-text translation
+    "source_lang": "en",  # Set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
+    "target_lang": "en",  # Language of the text output, choices=['en','de','es','fr']
+    "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
 }
 ```
     "taskname": "asr",
     "source_lang": "en",
     "target_lang": "en",
+    "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
 }
 ```
 {
     "audio_filepath": "/path/to/audio.wav",  # path to the audio file
     "duration": 10000.0,  # duration of the audio
+    "taskname": "ast",
     "source_lang": "en",
     "target_lang": "de",
+    "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
 }
 ```
 ## Training
+Canary-1B is trained using the  NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs.
+The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/canary-2/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/canary-2/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
 The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
 The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
+The constituents of public data are as follows.
+#### English (25.5k hours)
+- Librispeech 960 hours
+- Fisher Corpus
+- Switchboard-1 Dataset
+- WSJ-0 and WSJ-1
+- National Speech Corpus (Part 1, Part 6)
+- VCTK
+- VoxPopuli (EN)
+- Europarl-ASR (EN)
+- Multilingual Librispeech (MLS EN) - 2,000 hour subset
+- Mozilla Common Voice (v7.0)
+- People's Speech - 12,000 hour subset
+- Mozilla Common Voice (v11.0)  - 1,474 hour subset
+#### German (2.5k hours)
+- Mozilla Common Voice (v12.0)  - 800 hour subset
+- Multilingual Librispeech (MLS DE) - 1,500 hour subset
+- VoxPopuli (DE) - 200 hr subset
+#### Spanish (1.4k hours)
+- Mozilla Common Voice (v12.0)  - 395 hour subset
+- Multilingual Librispeech (MLS ES) - 780 hour subset
+- VoxPopuli (ES) - 108 hour subset
+- Fisher  - 141 hour subset
+#### French (1.8k hours)
+- Mozilla Common Voice (v12.0)  - 708 hour subset
+- Multilingual Librispeech (MLS FR) - 926 hour subset
+- VoxPopuli (FR) - 165 hour subset
 ## Performance
 ### ASR Performance (w/o PnC)
+The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
+WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
 | **Version** | **Model**     | **En**   | **De**   | **Es**   | **Fr**   |
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|
 | 1.23.0  | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
+WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
+| **Version** | **Model**     | **En**   | **De**   | **Es**   | **Fr**   |
+|:---------:|:-----------:|:------:|:------:|:------:|:------:|
+| 1.23.0  | canary-1b | 3.06 | 4.19 | 3.15 | 4.12 |
 More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
 ### AST Performance
+We evaluate AST performance with BLEU score and use their native annotations with punctuation and capitalization.
+BLEU score on [FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
+| 1.23.0      | canary-1b | 22.66	   | 41.11      | 40.76      | 32.64      | 32.15      | 23.57      |
+BLEU score on [COVOST-v2](https://github.com/facebookresearch/covost) test set:
+| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
+|:-----------:|:---------:|:----------:|:----------:|:----------:|
+| 1.23.0      | canary-1b | 37.67      | 40.7       | 40.42      |
+BLEU score on [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
+| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
+|:-----------:|:---------:|:----------:|:----------:|:----------:|
+| 1.23.0      | canary-1b | 23.84      |   35.74    | 28.29      |
 ## NVIDIA Riva: Deployment