victan commited on
Commit
fd52cc7
1 Parent(s): a993bec

Upload seamless_communication/cli/m4t/predict/README.md with huggingface_hub

Browse files
seamless_communication/cli/m4t/predict/README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Inference with SeamlessM4T models
2
+ Refer to the [SeamlessM4T README](../../../../../docs/m4t) for an overview of the M4T models.
3
+
4
+ Inference is run with the CLI, from the root directory of the repository.
5
+
6
+ The model can be specified with `--model_name` `seamlessM4T_v2_large`, `seamlessM4T_large` or `seamlessM4T_medium`:
7
+
8
+ **S2ST**:
9
+ ```bash
10
+ m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_large
11
+ ```
12
+
13
+ **S2TT**:
14
+ ```bash
15
+ m4t_predict <path_to_input_audio> --task s2tt --tgt_lang <tgt_lang>
16
+ ```
17
+
18
+ **T2TT**:
19
+ ```bash
20
+ m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>
21
+ ```
22
+
23
+ **T2ST**:
24
+ ```bash
25
+ m4t_predict <input_text> --task t2st --tgt_lang <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio>
26
+ ```
27
+
28
+ **ASR**:
29
+ ```bash
30
+ m4t_predict <path_to_input_audio> --task asr --tgt_lang <tgt_lang>
31
+ ```
32
+ Please set --ngram-filtering to True to get the same translation performance as the [demo](https://seamless.metademolab.com/).
33
+
34
+ The input audio must be 16kHz currently. Here's how you could resample your audio:
35
+ ```python
36
+ import torchaudio
37
+ resample_rate = 16000
38
+ waveform, sample_rate = torchaudio.load(<path_to_input_audio>)
39
+ resampler = torchaudio.transforms.Resample(sample_rate, resample_rate, dtype=waveform.dtype)
40
+ resampled_waveform = resampler(waveform)
41
+ torchaudio.save(<path_to_resampled_audio>, resampled_waveform, resample_rate)
42
+ ```
43
+ ## Inference breakdown
44
+
45
+ Inference calls for the `Translator` object instantiated with a multitask UnitY or UnitY2 model with the options:
46
+ - [`seamlessM4T_v2_large`](https://huggingface.co/facebook/seamless-m4t-v2-large)
47
+ - [`seamlessM4T_large`](https://huggingface.co/facebook/seamless-m4t-large)
48
+ - [`seamlessM4T_medium`](https://huggingface.co/facebook/seamless-m4t-medium)
49
+
50
+ and a vocoder:
51
+ - `vocoder_v2` for `seamlessM4T_v2_large`.
52
+ - `vocoder_36langs` for `seamlessM4T_large` or `seamlessM4T_medium`.
53
+
54
+ ```python
55
+ import torch
56
+ import torchaudio
57
+ from seamless_communication.inference import Translator
58
+
59
+
60
+ # Initialize a Translator object with a multitask model, vocoder on the GPU.
61
+ translator = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
62
+ ```
63
+
64
+ Now `predict()` can be used to run inference as many times on any of the supported tasks.
65
+
66
+ Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`,
67
+ we first set the `text_generation_opts`, `unit_generation_opts` and then translate into `<tgt_lang>` as follows:
68
+
69
+ ## S2ST and T2ST:
70
+
71
+ ```python
72
+ # S2ST
73
+ text_output, speech_output = translator.predict(
74
+ input=<path_to_input_audio>,
75
+ task_str="S2ST",
76
+ tgt_lang=<tgt_lang>,
77
+ text_generation_opts=text_generation_opts,
78
+ unit_generation_opts=unit_generation_opts
79
+ )
80
+
81
+ # T2ST
82
+ text_output, speech_output = translator.predict(
83
+ input=<input_text>,
84
+ task_str="T2ST",
85
+ tgt_lang=<tgt_lang>,
86
+ src_lang=<src_lang>,
87
+ text_generation_opts=text_generation_opts,
88
+ unit_generation_opts=unit_generation_opts
89
+ )
90
+
91
+ ```
92
+ Note that `<src_lang>` must be specified for T2ST.
93
+
94
+ The generated units are synthesized and the output audio file is saved with:
95
+
96
+ ```python
97
+ # Save the translated audio generation.
98
+ torchaudio.save(
99
+ <path_to_save_audio>,
100
+ speech_output.audio_wavs[0][0].cpu(),
101
+ sample_rate=speech_output.sample_rate,
102
+ )
103
+ ```
104
+ ## S2TT, T2TT and ASR:
105
+
106
+ ```python
107
+ # S2TT
108
+ text_output, _ = translator.predict(
109
+ input=<path_to_input_audio>,
110
+ task_str="S2TT",
111
+ tgt_lang=<tgt_lang>,
112
+ text_generation_opts=text_generation_opts,
113
+ unit_generation_opts=None
114
+ )
115
+
116
+ # ASR
117
+ # This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
118
+ text_output, _ = translator.predict(
119
+ input=<path_to_input_audio>,
120
+ task_str="ASR",
121
+ tgt_lang=<src_lang>,
122
+ text_generation_opts=text_generation_opts,
123
+ unit_generation_opts=None
124
+ )
125
+
126
+ # T2TT
127
+ text_output, _ = translator.predict(
128
+ input=<input_text>,
129
+ task_str="T2TT",
130
+ tgt_lang=<tgt_lang>,
131
+ src_lang=<src_lang>,
132
+ text_generation_opts=text_generation_opts,
133
+ unit_generation_opts=None
134
+ )
135
+
136
+ ```
137
+ Note that `<src_lang>` must be specified for T2TT