facebook
/

hf-seamless-m4t-large

Text-to-Speech

Transformers

PyTorch

seamless_m4t

feature-extraction

SeamlessM4T

Model card Files Files and versions Community

ylacombe commited on Sep 20, 2023

Commit

cb90476

1 Parent(s): c0ab532

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -7

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ This is the "large" variant of the unified model, which enables multiple tasks w
 - Text-to-text translation (T2TT)
 - Automatic speech recognition (ASR)
-You can perform all the above tasks from one single model - `SeamlessM4TModel`, but each task also has its own dedicated sub-model.
 ## 🤗 Usage
@@ -42,7 +42,7 @@ You can seamlessly use this model on text or on audio, to generated either trans
 ### Speech
-You can easily generate translated speech with [`SeamlessM4TModel.generate`]. Here is an example showing how to generate speech from English to Russian.
 ```python
 inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
@@ -57,9 +57,7 @@ You can also translate directly from a speech waveform. Here is an example from
 from datasets import load_dataset
 dataset = load_dataset("arabic_speech_corpus", split="test[0:1]")
 audio_sample = dataset["audio"][0]["array"]
 inputs = processor(audios = audio_sample, return_tensors="pt")
 audio_array = model.generate(**inputs, tgt_lang="rus")
@@ -86,7 +84,7 @@ scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sampling_rate, data=audio_ar
 #### Tips
-[`SeamlessM4TModel`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
 For example, you can replace the previous snippet with the model dedicated to the S2ST task:
 ```python
@@ -103,7 +101,6 @@ Similarly, you can generate translated text from text or audio files, this time
 from transformers import SeamlessM4TForSpeechToText
 model = SeamlessM4TForSpeechToText.from_pretrained("ylacombe/hf-seamless-m4t-medium")
 audio_sample = dataset["audio"][0]["array"]
 inputs = processor(audios = audio_sample, return_tensors="pt")
 output_tokens = model.generate(**inputs, tgt_lang="fra")
@@ -125,7 +122,7 @@ translated_text = processor.decode(output_tokens.tolist()[0], skip_special_token
 Three last tips:
-1. [`SeamlessM4TModel`] can generate text and/or speech. Pass `generate_speech=False` to [`SeamlessM4TModel.generate`] to only generate text. You also have the possibility to pass `return_intermediate_token_ids=True`, to get both text token ids and the generated speech.
 2. You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument.
 3. You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.

 - Text-to-text translation (T2TT)
 - Automatic speech recognition (ASR)
+You can perform all the above tasks from one single model, [`SeamlessM4TModel`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel), but each task also has its own dedicated sub-model.
 ## 🤗 Usage
 ### Speech
+You can easily generate translated speech with [`SeamlessM4TModel.generate`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate). Here is an example showing how to generate speech from English to Russian.
 ```python
 inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
 from datasets import load_dataset
 dataset = load_dataset("arabic_speech_corpus", split="test[0:1]")
 audio_sample = dataset["audio"][0]["array"]
 inputs = processor(audios = audio_sample, return_tensors="pt")
 audio_array = model.generate(**inputs, tgt_lang="rus")
 #### Tips
+[`SeamlessM4TModel`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
 For example, you can replace the previous snippet with the model dedicated to the S2ST task:
 ```python
 from transformers import SeamlessM4TForSpeechToText
 model = SeamlessM4TForSpeechToText.from_pretrained("ylacombe/hf-seamless-m4t-medium")
 audio_sample = dataset["audio"][0]["array"]
 inputs = processor(audios = audio_sample, return_tensors="pt")
 output_tokens = model.generate(**inputs, tgt_lang="fra")
 Three last tips:
+1. [`SeamlessM4TModel`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) can generate text and/or speech. Pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://moon-ci-docs.huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate) to only generate text. You also have the possibility to pass `return_intermediate_token_ids=True`, to get both text token ids and the generated speech.
 2. You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument.
 3. You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.