Update README.md

#2
by sanchit-gandhi HF staff - opened
Files changed (1) hide show
  1. README.md +29 -52
README.md CHANGED
@@ -5,15 +5,17 @@ tags:
5
  license: cc-by-nc-4.0
6
  ---
7
 
8
- # SeamlessM4T
9
- SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
 
 
10
 
11
  SeamlessM4T covers:
12
  - 📥 101 languages for speech input
13
  - ⌨️ 96 Languages for text input/output
14
  - 🗣️ 35 languages for speech output.
15
 
16
- This unified model enables multiple tasks without relying on multiple separate models:
17
  - Speech-to-speech translation (S2ST)
18
  - Speech-to-text translation (S2TT)
19
  - Text-to-speech translation (T2ST)
@@ -21,37 +23,40 @@ This unified model enables multiple tasks without relying on multiple separate m
21
  - Automatic speech recognition (ASR)
22
 
23
  ## SeamlessM4T models
 
 
 
24
  | Model Name | #params | checkpoint | metrics |
25
  | - | - | - | - |
26
- | SeamlessM4T-Large | 2.3B |[🤗 Model card](https://huggingface.co/facebook/seamless-m4t-large) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/resolve/main/multitask_unity_large.pt) | [metrics]() |
27
- | SeamlessM4T-Medium | 1.2B |[🤗 Model card](https://huggingface.co/facebook/seamless-m4t-medium) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/resolve/main/multitask_unity_medium.pt) | [metrics]() |
28
 
29
- We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above.
30
 
31
  ## Instructions to run inference with SeamlessM4T models
32
 
33
- Install `seamless_communication` by following the instructions mentioned here: [Installation](https://github.com/fairinternal/seamless_communication/tree/main#installation)
34
-
35
- Inference calls for the `Translator` object instanciated with a Multitasking UnitY model with the options:
36
- - `multitask_unity_large`
37
- - `multitask_unity_medium`
38
 
39
- and a vocoder `vocoder_36langs`
 
 
 
 
40
 
41
  ```python
42
  import torch
43
- import torchaudio
44
  from seamless_communication.models.inference import Translator
45
 
46
 
47
  # Initialize a Translator object with a multitask model, vocoder on the GPU.
48
- translator = Translator("multitask_unity_large", "vocoder_36langs", torch.device("cuda:0"))
49
  ```
50
 
51
- Now `predict()` can be used to run inference as many times on any of the supported tasks.
52
 
53
- Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`,
54
- we can translate into `<tgt_lang>` as follows:
55
 
56
  ### S2ST and T2ST:
57
 
@@ -61,8 +66,8 @@ translated_text, wav, sr = translator.predict(<path_to_input_audio>, "s2st", <tg
61
 
62
  # T2ST
63
  translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)
64
-
65
  ```
 
66
  Note that `<src_lang>` must be specified for T2ST.
67
 
68
  The generated units are synthesized and the output audio file is saved with:
@@ -92,49 +97,21 @@ transcribed_text, _, _ = translator.predict(<path_to_input_audio>, "asr", <src_l
92
  translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)
93
 
94
  ```
95
- Note that `<src_lang>` must be specified for T2TT
96
-
97
-
98
- ### Inference using the CLI, from the root directory of the repository:
99
-
100
- The model can be specified with e.g., `--model_name multitask_unity_large`:
101
-
102
- S2ST:
103
- ```
104
- python scripts/m4t/predict/predict.py <path_to_input_audio> s2st <tgt_lang> --output_path <path_to_save_audio> --model_name multitask_unity_large
105
- ```
106
-
107
- S2TT:
108
- ```
109
- python scripts/m4t/predict/predict.py <path_to_input_audio> s2tt <tgt_lang>
110
- ```
111
-
112
- T2TT:
113
- ```
114
- python scripts/m4t/predict/predict.py <input_text> t2tt <tgt_lang> --src_lang <src_lang>
115
- ```
116
-
117
- T2ST:
118
- ```
119
- python scripts/m4t/predict/predict.py <input_text> t2st <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio>
120
- ```
121
-
122
- ASR:
123
- ```
124
- python scripts/m4t/predict/predict.py <path_to_input_audio> asr <tgt_lang>
125
- ```
126
 
127
  ## Citation
128
- If you use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite :
 
129
 
130
  ```bibtex
131
  @article{seamlessm4t2023,
132
- title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
133
  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
134
  journal={ArXiv},
135
  year={2023}
136
  }
137
  ```
 
138
  ## License
139
 
140
- seamless_communication is CC-BY-NC 4.0 licensed.
 
5
  license: cc-by-nc-4.0
6
  ---
7
 
8
+ # SeamlessM4T Large
9
+
10
+ SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
11
+ linguistic communities to communicate effortlessly through speech and text.
12
 
13
  SeamlessM4T covers:
14
  - 📥 101 languages for speech input
15
  - ⌨️ 96 Languages for text input/output
16
  - 🗣️ 35 languages for speech output.
17
 
18
+ This is the "large" variant of the unified model, which enables multiple tasks without relying on multiple separate models:
19
  - Speech-to-speech translation (S2ST)
20
  - Speech-to-text translation (S2TT)
21
  - Text-to-speech translation (T2ST)
 
23
  - Automatic speech recognition (ASR)
24
 
25
  ## SeamlessM4T models
26
+
27
+ The SeamlessM4T models come in two checkpoints of different size:
28
+
29
  | Model Name | #params | checkpoint | metrics |
30
  | - | - | - | - |
31
+ | [SeamlessM4T-Medium]((https://huggingface.co/facebook/seamless-m4t-medium)) | 1.2B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/resolve/main/multitask_unity_medium.pt) | [metrics]() |
32
+ | [SeamlessM4T-Large](https://huggingface.co/facebook/seamless-m4t-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/resolve/main/multitask_unity_large.pt) | [metrics]() |
33
 
34
+ We provide extensive evaluation results of SeamlessM4T-Medium and SeamlessM4T-Large in the SeamlessM4T paper (as averages) in the `metrics` files above.
35
 
36
  ## Instructions to run inference with SeamlessM4T models
37
 
38
+ The SeamlessM4T models are currently available through the `seamless_communication` package. The `seamless_communication`
39
+ package can be installed by following the instructions outlined here: [Installation](https://github.com/fairinternal/seamless_communication/tree/main#installation).
 
 
 
40
 
41
+ Once installed, a [`Translator`](https://github.com/fairinternal/seamless_communication/blob/590547965b343b590d15847a0aa25a6779fc3753/src/seamless_communication/models/inference/translator.py#L47)
42
+ object can be instantiated to perform all five of the spoken langauge tasks. The `Translator` is instantiated with three arguments:
43
+ 1. `model_name_or_card`: SeamlessM4T checkpoint. Can be either `multitask_unity_medium` for the medium model, or `multitask_unity_large` for the large model
44
+ 2. `vocoder_name_or_card`: vocoder checkpoint (`vocoder_36langs`)
45
+ 3. `device`: Torch device
46
 
47
  ```python
48
  import torch
 
49
  from seamless_communication.models.inference import Translator
50
 
51
 
52
  # Initialize a Translator object with a multitask model, vocoder on the GPU.
53
+ translator = Translator("multitask_unity_large", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"))
54
  ```
55
 
56
+ Once instantiated, the `predict()` method can be used to run inference as many times on any of the supported tasks.
57
 
58
+ Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`, we can translate
59
+ into `<tgt_lang>` as follows.
60
 
61
  ### S2ST and T2ST:
62
 
 
66
 
67
  # T2ST
68
  translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)
 
69
  ```
70
+
71
  Note that `<src_lang>` must be specified for T2ST.
72
 
73
  The generated units are synthesized and the output audio file is saved with:
 
97
  translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)
98
 
99
  ```
100
+ Note that `<src_lang>` must be specified for T2TT.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ## Citation
103
+
104
+ If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:
105
 
106
  ```bibtex
107
  @article{seamlessm4t2023,
108
+ title={"SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation"},
109
  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
110
  journal={ArXiv},
111
  year={2023}
112
  }
113
  ```
114
+
115
  ## License
116
 
117
+ The Seamless Communication code and weights are CC-BY-NC 4.0 licensed.