Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -107,6 +107,7 @@ img {
107
  NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
108
 
109
  ## Model Architecture
 
110
  Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
111
  With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>`
112
  are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer from individual
@@ -248,11 +249,11 @@ The training data contains 43K hours of English speech collected and prepared by
248
 
249
  ## Performance
250
 
251
- The ASR performance is measured with word error rate (WER) on different datasets, whereas the AST performance is measured with BLEU score. Predictions were generated using beam search with width 5 and length penalty 1.0.
252
 
253
  ### ASR Performance (w/o PnC)
254
 
255
- We use [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test sets on four languages, and process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
256
 
257
 
258
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
@@ -264,7 +265,7 @@ More details on evaluation can be found at [HuggingFace ASR Leaderboard](https:/
264
 
265
  ### AST Performance
266
 
267
- We evaluate on the FLEURS test sets and use the native annotations with punctuation and capitalization.
268
 
269
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
270
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
 
107
  NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
108
 
109
  ## Model Architecture
110
+
111
  Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
112
  With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>`
113
  are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer from individual
 
249
 
250
  ## Performance
251
 
252
+ In both ASR and AST experiments, predictions were generated using beam search with width 5 and length penalty 1.0.
253
 
254
  ### ASR Performance (w/o PnC)
255
 
256
+ The ASR performance is measured with word error rate (WER) on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test sets on four languages, and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
257
 
258
 
259
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
 
265
 
266
  ### AST Performance
267
 
268
+ We evaluate AST performance with BLEU score on the [FLEURS](https://huggingface.co/datasets/google/fleurs) test sets on four languages and use their native annotations with punctuation and capitalization.
269
 
270
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
271
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|