ylacombe HF staff commited on
Commit
1dc9051
1 Parent(s): ca1c0f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -35
README.md CHANGED
@@ -24,90 +24,104 @@ This is the "medium" variant of the unified model, which enables multiple tasks
24
  - Text-to-text translation (T2TT)
25
  - Automatic speech recognition (ASR)
26
 
27
- You can perform all the above tasks from one single model - `SeamlessM4TModel`, but each task also has its own dedicated sub-model.
28
 
29
 
30
- ## Usage
31
 
32
  First, load the processor and a checkpoint of the model:
33
 
34
  ```python
35
- >>> from transformers import AutoProcessor, SeamlessM4TModel
36
 
37
- >>> processor = AutoProcessor.from_pretrained("ylacombe/hf-seamless-m4t-medium")
38
- >>> model = SeamlessM4TModel.from_pretrained("ylacombe/hf-seamless-m4t-medium")
39
  ```
40
 
41
  You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
42
 
43
  ### Speech
44
 
45
- You can easily generate translated speech with [`SeamlessM4TModel.generate`]. Here is an example showing how to generate speech from English to Russian.
46
 
47
  ```python
48
- >>> inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
49
 
50
- >>> audio_array = model.generate(**inputs, tgt_lang="rus")
51
- >>> audio_array = audio_array[0].cpu().numpy().squeeze()
52
  ```
53
 
54
  You can also translate directly from a speech waveform. Here is an example from Arabic to English:
55
 
56
  ```python
57
- >>> from datasets import load_dataset
58
 
59
- >>> dataset = load_dataset("arabic_speech_corpus", split="test[0:1]")
 
 
60
 
61
- >>> audio_sample = dataset["audio"][0]["array"]
62
-
63
- >>> inputs = processor(audios = audio_sample, return_tensors="pt")
 
 
 
 
 
64
 
65
- >>> audio_array = model.generate(**inputs, tgt_lang="rus")
66
- >>> audio_array = audio_array[0].cpu().numpy().squeeze()
 
 
 
 
 
 
 
 
 
67
  ```
68
 
69
  #### Tips
70
 
71
- [`SeamlessM4TModel`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
72
  For example, you can replace the previous snippet with the model dedicated to the S2ST task:
73
 
74
  ```python
75
- >>> from transformers import SeamlessM4TForSpeechToSpeech
76
- >>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("ylacombe/hf-seamless-m4t-medium")
77
  ```
78
 
79
 
80
  ### Text
81
 
82
- Similarly, you can generate translated text from text or audio files, this time using the dedicated models.
83
 
84
  ```python
85
- >>> from transformers import SeamlessM4TForSpeechToText
86
- >>> model = SeamlessM4TForSpeechToText.from_pretrained("ylacombe/hf-seamless-m4t-medium")
87
- >>> audio_sample = dataset["audio"][0]["array"]
88
-
89
- >>> inputs = processor(audios = audio_sample, return_tensors="pt")
90
-
91
- >>> output_tokens = model.generate(**inputs, tgt_lang="fra")
92
- >>> translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
93
  ```
94
 
95
  And from text:
96
 
97
  ```python
98
- >>> from transformers import SeamlessM4TForTextToText
99
- >>> model = SeamlessM4TForTextToText.from_pretrained("ylacombe/hf-seamless-m4t-medium")
100
- >>> inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
101
 
102
- >>> output_tokens = model.generate(**inputs, tgt_lang="fra")
103
- >>> translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
104
  ```
105
 
106
  #### Tips
107
 
108
  Three last tips:
109
 
110
- 1. [`SeamlessM4TModel`] can generate text and/or speech. Pass `generate_speech=False` to [`SeamlessM4TModel.generate`] to only generate text. You also have the possibility to pass `return_intermediate_token_ids=True`, to get both text token ids and the generated speech.
111
  2. You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument.
112
  3. You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.
113
-
 
24
  - Text-to-text translation (T2TT)
25
  - Automatic speech recognition (ASR)
26
 
27
+ You can perform all the above tasks from one single model, [`SeamlessM4TModel`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel), but each task also has its own dedicated sub-model.
28
 
29
 
30
+ ## 🤗 Usage
31
 
32
  First, load the processor and a checkpoint of the model:
33
 
34
  ```python
35
+ from transformers import AutoProcessor, SeamlessM4TModel
36
 
37
+ processor = AutoProcessor.from_pretrained("ylacombe/hf-seamless-m4t-medium")
38
+ model = SeamlessM4TModel.from_pretrained("ylacombe/hf-seamless-m4t-medium")
39
  ```
40
 
41
  You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
42
 
43
  ### Speech
44
 
45
+ You can easily generate translated speech with [`SeamlessM4TModel.generate`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate). Here is an example showing how to generate speech from English to Russian.
46
 
47
  ```python
48
+ inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
49
 
50
+ audio_array = model.generate(**inputs, tgt_lang="rus")
51
+ audio_array = audio_array[0].cpu().numpy().squeeze()
52
  ```
53
 
54
  You can also translate directly from a speech waveform. Here is an example from Arabic to English:
55
 
56
  ```python
57
+ from datasets import load_dataset
58
 
59
+ dataset = load_dataset("arabic_speech_corpus", split="test[0:1]")
60
+ audio_sample = dataset["audio"][0]["array"]
61
+ inputs = processor(audios = audio_sample, return_tensors="pt")
62
 
63
+ audio_array = model.generate(**inputs, tgt_lang="rus")
64
+ audio_array = audio_array[0].cpu().numpy().squeeze()
65
+ ```
66
+
67
+ Listen to the speech samples either in an ipynb notebook:
68
+
69
+ ```python
70
+ from IPython.display import Audio
71
 
72
+ sampling_rate = model.config.sample_rate
73
+ Audio(audio_array, rate=sampling_rate)
74
+ ```
75
+
76
+ Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
77
+
78
+ ```python
79
+ import scipy
80
+
81
+ sampling_rate = model.config.sample_rate
82
+ scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sampling_rate, data=audio_array)
83
  ```
84
 
85
  #### Tips
86
 
87
+ [`SeamlessM4TModel`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
88
  For example, you can replace the previous snippet with the model dedicated to the S2ST task:
89
 
90
  ```python
91
+ from transformers import SeamlessM4TForSpeechToSpeech
92
+ model = SeamlessM4TForSpeechToSpeech.from_pretrained("ylacombe/hf-seamless-m4t-medium")
93
  ```
94
 
95
 
96
  ### Text
97
 
98
+ Similarly, you can generate translated text from text or audio files. This time, let's use the dedicated models as example.
99
 
100
  ```python
101
+ from transformers import SeamlessM4TForSpeechToText
102
+ model = SeamlessM4TForSpeechToText.from_pretrained("ylacombe/hf-seamless-m4t-medium")
103
+ audio_sample = dataset["audio"][0]["array"]
104
+ inputs = processor(audios = audio_sample, return_tensors="pt")
105
+
106
+ output_tokens = model.generate(**inputs, tgt_lang="fra")
107
+ translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
 
108
  ```
109
 
110
  And from text:
111
 
112
  ```python
113
+ from transformers import SeamlessM4TForTextToText
114
+ model = SeamlessM4TForTextToText.from_pretrained("ylacombe/hf-seamless-m4t-medium")
115
+ inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
116
 
117
+ output_tokens = model.generate(**inputs, tgt_lang="fra")
118
+ translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
119
  ```
120
 
121
  #### Tips
122
 
123
  Three last tips:
124
 
125
+ 1. [`SeamlessM4TModel`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) can generate text and/or speech. Pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate) to only generate text. You also have the possibility to pass `return_intermediate_token_ids=True`, to get both text token ids and the generated speech.
126
  2. You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument.
127
  3. You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.