nvidia
/

canary-1b

@@ -284,7 +284,7 @@ The Canay-1B model has 24 encoder layers and 24 layers of decoder layers in tota
 ## NVIDIA NeMo
-To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.
 ```
 pip install git+https://github.com/NVIDIA/NeMo.git@r1.23.0#egg=nemo_toolkit[asr]
 ```
@@ -309,24 +309,18 @@ canary_model.change_decoding_strategy(decode_cfg)
 ```
 ### Input Format
-The input to the model can be a directory containing audio files, in which case the model will perform ASR on English and produces text with punctuation and capitalization:
 ```python
 predicted_text = canary_model.transcribe(
-    audio_dir="<path to directory containing audios>",
     batch_size=16,  # batch size to run the inference with
 )
 ```
-or use:
-```bash
-python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
- pretrained_name="nvidia/canary-1b"
- audio_dir="<path to audio directory>"
-```
-Another recommended option is to use a json manifest as input, where each line in the file is a dictionary containing the following fields:
 ```yaml
 # Example of a line in input_manifest.json
 {
@@ -348,13 +342,6 @@ predicted_text = canary_model.transcribe(
 )
 ```
-or use:
-```bash
-python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
- pretrained_name="nvidia/canary-1b"
- dataset_manifest="<path to manifest file>"
-```
 ### Automatic Speech-to-text Recognition (ASR)
@@ -391,6 +378,21 @@ An example manifest for transcribing English audios into German text can be:
 }
 ```
 ### Input

 ## NVIDIA NeMo
+To train, fine-tune or Transcribe with Canary, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.
 ```
 pip install git+https://github.com/NVIDIA/NeMo.git@r1.23.0#egg=nemo_toolkit[asr]
 ```
 ```
 ### Input Format
+Input to Canary can be either a list of paths to audio files or a jsonl manifest file.
+If the input is a list of paths, Canary assumes that the audio is English and Transcribes it. I.e., Canary default behaviour is English ASR.
 ```python
 predicted_text = canary_model.transcribe(
+    paths2audio_files=['path1.wav', 'path2.wav'],
     batch_size=16,  # batch size to run the inference with
 )
 ```
+To use Canary for transcribing other supported languages or perform Speech-to-Text translation, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
 ```yaml
 # Example of a line in input_manifest.json
 {
 )
 ```
 ### Automatic Speech-to-text Recognition (ASR)
 }
 ```
+Alternatively, one can use `transcribe_speech.py` script to do the same.
+```bash
+python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
+ pretrained_name="nvidia/canary-1b"
+ audio_dir="<path to audio_directory>" # transcribes all the wav files in audio_directory
+```
+```bash
+python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
+ pretrained_name="nvidia/canary-1b"
+ dataset_manifest="<path to manifest file>"
+```
 ### Input