Evaluating SeamlessStreaming and Seamless models
SeamlessStreaming is the streaming only model and Seamless is the expressive streaming model.
Quick start:
Evaluation can be run with the streaming_evaluate
CLI.
We use the seamless_streaming_unity
for loading the speech encoder and T2U models, and seamless_streaming_monotonic_decoder
for loading the text decoder for streaming evaluation. This is already set as defaults for the streaming_evaluate
CLI, but can be overridden using the --unity-model-name
and --monotonic-decoder-model-name
args if required.
Note that the numbers in our paper use single precision floating point format (fp32) for evaluation by setting --dtype fp32
. Also note that the results from running these evaluations might be slightly different from the results reported in our paper (which will be updated soon with the new results).
S2TT:
Set the task to s2tt
for evaluating the speech-to-text translation part of the SeamlessStreaming model.
streaming_evaluate --task s2tt --data-file <path_to_data_tsv_file> --audio-root-dir <path_to_audio_root_directory> --output <path_to_evaluation_output_directory> --tgt-lang <3_letter_lang_code>
Note: The --ref-field
can be used to specify the name of the reference column in the dataset.
ASR:
Set the task to asr
for evaluating the automatic speech recognition part of the SeamlessStreaming model. Make sure to pass the source language as the --tgt-lang
arg.
streaming_evaluate --task asr --data-file <path_to_data_tsv_file> --audio-root-dir <path_to_audio_root_directory> --output <path_to_evaluation_output_directory> --tgt-lang <3_letter_source_lang_code>
S2ST:
SeamlessStreaming:
Set the task to s2st
for evaluating the speech-to-speech translation part of the SeamlessStreaming model.
streaming_evaluate --task s2st --data-file <path_to_data_tsv_file> --audio-root-dir <path_to_audio_root_directory> --output <path_to_evaluation_output_directory> --tgt-lang <3_letter_lang_code>
Seamless:
The Seamless model is an unified model for streaming expressive speech-to-speech tranlsation. Use the --expressive
arg for running evaluation of this unified model.
streaming_evaluate --task s2st --data-file <path_to_data_tsv_file> --audio-root-dir <path_to_audio_root_directory> --output <path_to_evaluation_output_directory> --tgt-lang <3_letter_lang_code> --expressive
The Seamless model uses vocoder_pretssel
which is a 24KHz version (vocoder_pretssel
) by default. In the current version of our paper, we use 16KHz version (vocoder_pretssel_16khz
) for the evaluation , so in order to reproduce those results please add this arg to the above command: --vocoder-name vocoder_pretssel_16khz
.
vocoder_pretssel
or vocoder_pretssel_16khz
checkpoints are gated, please check out this section to acquire these checkpoints. Also, make sure to add --gated-model-dir <path_to_vocoder_checkpoints_dir>