# Inference with SeamlessM4T models Refer to the [SeamlessM4T README](../../../../../docs/m4t) for an overview of the M4T models. Inference is run with the CLI, from the root directory of the repository. The model can be specified with `--model_name` `seamlessM4T_v2_large`, `seamlessM4T_large` or `seamlessM4T_medium`: **S2ST**: ```bash m4t_predict --task s2st --tgt_lang --output_path --model_name seamlessM4T_large ``` **S2TT**: ```bash m4t_predict --task s2tt --tgt_lang ``` **T2TT**: ```bash m4t_predict --task t2tt --tgt_lang --src_lang ``` **T2ST**: ```bash m4t_predict --task t2st --tgt_lang --src_lang --output_path ``` **ASR**: ```bash m4t_predict --task asr --tgt_lang ``` Please set --ngram-filtering to True to get the same translation performance as the [demo](https://seamless.metademolab.com/). The input audio must be 16kHz currently. Here's how you could resample your audio: ```python import torchaudio resample_rate = 16000 waveform, sample_rate = torchaudio.load() resampler = torchaudio.transforms.Resample(sample_rate, resample_rate, dtype=waveform.dtype) resampled_waveform = resampler(waveform) torchaudio.save(, resampled_waveform, resample_rate) ``` ## Inference breakdown Inference calls for the `Translator` object instantiated with a multitask UnitY or UnitY2 model with the options: - [`seamlessM4T_v2_large`](https://huggingface.co/facebook/seamless-m4t-v2-large) - [`seamlessM4T_large`](https://huggingface.co/facebook/seamless-m4t-large) - [`seamlessM4T_medium`](https://huggingface.co/facebook/seamless-m4t-medium) and a vocoder: - `vocoder_v2` for `seamlessM4T_v2_large`. - `vocoder_36langs` for `seamlessM4T_large` or `seamlessM4T_medium`. ```python import torch import torchaudio from seamless_communication.inference import Translator # Initialize a Translator object with a multitask model, vocoder on the GPU. translator = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16) ``` Now `predict()` can be used to run inference as many times on any of the supported tasks. Given an input audio with `` or an input text `` in ``, we first set the `text_generation_opts`, `unit_generation_opts` and then translate into `` as follows: ## S2ST and T2ST: ```python # S2ST text_output, speech_output = translator.predict( input=, task_str="S2ST", tgt_lang=, text_generation_opts=text_generation_opts, unit_generation_opts=unit_generation_opts ) # T2ST text_output, speech_output = translator.predict( input=, task_str="T2ST", tgt_lang=, src_lang=, text_generation_opts=text_generation_opts, unit_generation_opts=unit_generation_opts ) ``` Note that `` must be specified for T2ST. The generated units are synthesized and the output audio file is saved with: ```python # Save the translated audio generation. torchaudio.save( , speech_output.audio_wavs[0][0].cpu(), sample_rate=speech_output.sample_rate, ) ``` ## S2TT, T2TT and ASR: ```python # S2TT text_output, _ = translator.predict( input=, task_str="S2TT", tgt_lang=, text_generation_opts=text_generation_opts, unit_generation_opts=None ) # ASR # This is equivalent to S2TT with `=`. text_output, _ = translator.predict( input=, task_str="ASR", tgt_lang=, text_generation_opts=text_generation_opts, unit_generation_opts=None ) # T2TT text_output, _ = translator.predict( input=, task_str="T2TT", tgt_lang=, src_lang=, text_generation_opts=text_generation_opts, unit_generation_opts=None ) ``` Note that `` must be specified for T2TT