Upload seamless_communication/cli/m4t/predict/README.md with huggingface_hub
Browse files
seamless_communication/cli/m4t/predict/README.md
ADDED
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Inference with SeamlessM4T models
|
2 |
+
Refer to the [SeamlessM4T README](../../../../../docs/m4t) for an overview of the M4T models.
|
3 |
+
|
4 |
+
Inference is run with the CLI, from the root directory of the repository.
|
5 |
+
|
6 |
+
The model can be specified with `--model_name` `seamlessM4T_v2_large`, `seamlessM4T_large` or `seamlessM4T_medium`:
|
7 |
+
|
8 |
+
**S2ST**:
|
9 |
+
```bash
|
10 |
+
m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_large
|
11 |
+
```
|
12 |
+
|
13 |
+
**S2TT**:
|
14 |
+
```bash
|
15 |
+
m4t_predict <path_to_input_audio> --task s2tt --tgt_lang <tgt_lang>
|
16 |
+
```
|
17 |
+
|
18 |
+
**T2TT**:
|
19 |
+
```bash
|
20 |
+
m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>
|
21 |
+
```
|
22 |
+
|
23 |
+
**T2ST**:
|
24 |
+
```bash
|
25 |
+
m4t_predict <input_text> --task t2st --tgt_lang <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio>
|
26 |
+
```
|
27 |
+
|
28 |
+
**ASR**:
|
29 |
+
```bash
|
30 |
+
m4t_predict <path_to_input_audio> --task asr --tgt_lang <tgt_lang>
|
31 |
+
```
|
32 |
+
Please set --ngram-filtering to True to get the same translation performance as the [demo](https://seamless.metademolab.com/).
|
33 |
+
|
34 |
+
The input audio must be 16kHz currently. Here's how you could resample your audio:
|
35 |
+
```python
|
36 |
+
import torchaudio
|
37 |
+
resample_rate = 16000
|
38 |
+
waveform, sample_rate = torchaudio.load(<path_to_input_audio>)
|
39 |
+
resampler = torchaudio.transforms.Resample(sample_rate, resample_rate, dtype=waveform.dtype)
|
40 |
+
resampled_waveform = resampler(waveform)
|
41 |
+
torchaudio.save(<path_to_resampled_audio>, resampled_waveform, resample_rate)
|
42 |
+
```
|
43 |
+
## Inference breakdown
|
44 |
+
|
45 |
+
Inference calls for the `Translator` object instantiated with a multitask UnitY or UnitY2 model with the options:
|
46 |
+
- [`seamlessM4T_v2_large`](https://huggingface.co/facebook/seamless-m4t-v2-large)
|
47 |
+
- [`seamlessM4T_large`](https://huggingface.co/facebook/seamless-m4t-large)
|
48 |
+
- [`seamlessM4T_medium`](https://huggingface.co/facebook/seamless-m4t-medium)
|
49 |
+
|
50 |
+
and a vocoder:
|
51 |
+
- `vocoder_v2` for `seamlessM4T_v2_large`.
|
52 |
+
- `vocoder_36langs` for `seamlessM4T_large` or `seamlessM4T_medium`.
|
53 |
+
|
54 |
+
```python
|
55 |
+
import torch
|
56 |
+
import torchaudio
|
57 |
+
from seamless_communication.inference import Translator
|
58 |
+
|
59 |
+
|
60 |
+
# Initialize a Translator object with a multitask model, vocoder on the GPU.
|
61 |
+
translator = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
|
62 |
+
```
|
63 |
+
|
64 |
+
Now `predict()` can be used to run inference as many times on any of the supported tasks.
|
65 |
+
|
66 |
+
Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`,
|
67 |
+
we first set the `text_generation_opts`, `unit_generation_opts` and then translate into `<tgt_lang>` as follows:
|
68 |
+
|
69 |
+
## S2ST and T2ST:
|
70 |
+
|
71 |
+
```python
|
72 |
+
# S2ST
|
73 |
+
text_output, speech_output = translator.predict(
|
74 |
+
input=<path_to_input_audio>,
|
75 |
+
task_str="S2ST",
|
76 |
+
tgt_lang=<tgt_lang>,
|
77 |
+
text_generation_opts=text_generation_opts,
|
78 |
+
unit_generation_opts=unit_generation_opts
|
79 |
+
)
|
80 |
+
|
81 |
+
# T2ST
|
82 |
+
text_output, speech_output = translator.predict(
|
83 |
+
input=<input_text>,
|
84 |
+
task_str="T2ST",
|
85 |
+
tgt_lang=<tgt_lang>,
|
86 |
+
src_lang=<src_lang>,
|
87 |
+
text_generation_opts=text_generation_opts,
|
88 |
+
unit_generation_opts=unit_generation_opts
|
89 |
+
)
|
90 |
+
|
91 |
+
```
|
92 |
+
Note that `<src_lang>` must be specified for T2ST.
|
93 |
+
|
94 |
+
The generated units are synthesized and the output audio file is saved with:
|
95 |
+
|
96 |
+
```python
|
97 |
+
# Save the translated audio generation.
|
98 |
+
torchaudio.save(
|
99 |
+
<path_to_save_audio>,
|
100 |
+
speech_output.audio_wavs[0][0].cpu(),
|
101 |
+
sample_rate=speech_output.sample_rate,
|
102 |
+
)
|
103 |
+
```
|
104 |
+
## S2TT, T2TT and ASR:
|
105 |
+
|
106 |
+
```python
|
107 |
+
# S2TT
|
108 |
+
text_output, _ = translator.predict(
|
109 |
+
input=<path_to_input_audio>,
|
110 |
+
task_str="S2TT",
|
111 |
+
tgt_lang=<tgt_lang>,
|
112 |
+
text_generation_opts=text_generation_opts,
|
113 |
+
unit_generation_opts=None
|
114 |
+
)
|
115 |
+
|
116 |
+
# ASR
|
117 |
+
# This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
|
118 |
+
text_output, _ = translator.predict(
|
119 |
+
input=<path_to_input_audio>,
|
120 |
+
task_str="ASR",
|
121 |
+
tgt_lang=<src_lang>,
|
122 |
+
text_generation_opts=text_generation_opts,
|
123 |
+
unit_generation_opts=None
|
124 |
+
)
|
125 |
+
|
126 |
+
# T2TT
|
127 |
+
text_output, _ = translator.predict(
|
128 |
+
input=<input_text>,
|
129 |
+
task_str="T2TT",
|
130 |
+
tgt_lang=<tgt_lang>,
|
131 |
+
src_lang=<src_lang>,
|
132 |
+
text_generation_opts=text_generation_opts,
|
133 |
+
unit_generation_opts=None
|
134 |
+
)
|
135 |
+
|
136 |
+
```
|
137 |
+
Note that `<src_lang>` must be specified for T2TT
|