[DX] Clearer instructions for SpeechT5 (#23)

Browse files

- [DX] Clearer instructions for SpeechT5 (f91af75e4b2b74475c44547936778c49279ef50d)
- Update README.md (214783a35e2c4f2f6b2bd7a792b1589fae805363)

Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +15 -13

README.md CHANGED Viewed

@@ -47,14 +47,20 @@ Extensive evaluations show the superiority of the proposed SpeechT5 framework on
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-## How to Get Started With the Model
-You can access the SpeechT5 model via the `Text-to-Speech` pipeline in just a couple lines of code!
-```python
-# Following pip packages need to be installed:
-# !pip install transformers sentencepiece datasets
 from transformers import pipeline
 from datasets import load_dataset
 import soundfile as sf
@@ -62,21 +68,17 @@ import soundfile as sf
 synthesiser = pipeline("text-to-speech", "microsoft/speech_tt5")
 embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
-speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
 # You can replace this embedding with your own as well.
-speech = pipe("Hello what is happening", forward_params={"speaker_embeddings": speaker_embeddings})
 sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
 ```
-For more fine-grained control you can use the processor + generate code to convert text into a mono 16 kHz speech waveform.
 ```python
-# Following pip packages need to be installed:
-# !pip install transformers sentencepiece datasets
 from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
 from datasets import load_dataset
 import torch
@@ -87,7 +89,7 @@ processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
 model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
 vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
-inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
 # load xvector containing speaker's voice characteristics from a dataset
 embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+## 🤗 Transformers Usage
+You can run SpeechT5 TTS locally with the 🤗 Transformers library.
+1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers), sentencepiece, soundfile and datasets(optional):
+```
+pip install --upgrade pip
+pip install --upgrade transformers sentencepiece datasets[audio]
+```
+2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the SpeechT5 model via the TTS pipeline in just a few lines of code!
+```python
 from transformers import pipeline
 from datasets import load_dataset
 import soundfile as sf
 synthesiser = pipeline("text-to-speech", "microsoft/speech_tt5")
 embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
+speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
 # You can replace this embedding with your own as well.
+speech = pipe("Hello, my dog is cooler than you!", forward_params={"speaker_embeddings": speaker_embedding})
 sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
 ```
+3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.
 ```python
 from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
 from datasets import load_dataset
 import torch
 model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
 vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
+inputs = processor(text="Hello, my dog is cute.", return_tensors="pt")
 # load xvector containing speaker's voice characteristics from a dataset
 embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")