ylacombe HF staff reach-vb HF staff commited on
Commit
a000471
1 Parent(s): 39b288d

[DX] Clearer instructions for SpeechT5 (#23)

Browse files

- [DX] Clearer instructions for SpeechT5 (f91af75e4b2b74475c44547936778c49279ef50d)
- Update README.md (214783a35e2c4f2f6b2bd7a792b1589fae805363)


Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +15 -13
README.md CHANGED
@@ -47,14 +47,20 @@ Extensive evaluations show the superiority of the proposed SpeechT5 framework on
47
 
48
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
49
 
50
- ## How to Get Started With the Model
51
 
52
- You can access the SpeechT5 model via the `Text-to-Speech` pipeline in just a couple lines of code!
53
 
54
- ```python
55
- # Following pip packages need to be installed:
56
- # !pip install transformers sentencepiece datasets
 
 
 
 
 
57
 
 
58
  from transformers import pipeline
59
  from datasets import load_dataset
60
  import soundfile as sf
@@ -62,21 +68,17 @@ import soundfile as sf
62
  synthesiser = pipeline("text-to-speech", "microsoft/speech_tt5")
63
 
64
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
65
- speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
66
  # You can replace this embedding with your own as well.
67
 
68
- speech = pipe("Hello what is happening", forward_params={"speaker_embeddings": speaker_embeddings})
69
 
70
  sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
71
-
72
  ```
73
 
74
- For more fine-grained control you can use the processor + generate code to convert text into a mono 16 kHz speech waveform.
75
 
76
  ```python
77
- # Following pip packages need to be installed:
78
- # !pip install transformers sentencepiece datasets
79
-
80
  from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
81
  from datasets import load_dataset
82
  import torch
@@ -87,7 +89,7 @@ processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
87
  model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
88
  vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
89
 
90
- inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
91
 
92
  # load xvector containing speaker's voice characteristics from a dataset
93
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
 
47
 
48
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
49
 
50
+ ## 🤗 Transformers Usage
51
 
52
+ You can run SpeechT5 TTS locally with the 🤗 Transformers library.
53
 
54
+ 1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers), sentencepiece, soundfile and datasets(optional):
55
+
56
+ ```
57
+ pip install --upgrade pip
58
+ pip install --upgrade transformers sentencepiece datasets[audio]
59
+ ```
60
+
61
+ 2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the SpeechT5 model via the TTS pipeline in just a few lines of code!
62
 
63
+ ```python
64
  from transformers import pipeline
65
  from datasets import load_dataset
66
  import soundfile as sf
 
68
  synthesiser = pipeline("text-to-speech", "microsoft/speech_tt5")
69
 
70
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
71
+ speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
72
  # You can replace this embedding with your own as well.
73
 
74
+ speech = pipe("Hello, my dog is cooler than you!", forward_params={"speaker_embeddings": speaker_embedding})
75
 
76
  sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
 
77
  ```
78
 
79
+ 3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.
80
 
81
  ```python
 
 
 
82
  from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
83
  from datasets import load_dataset
84
  import torch
 
89
  model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
90
  vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
91
 
92
+ inputs = processor(text="Hello, my dog is cute.", return_tensors="pt")
93
 
94
  # load xvector containing speaker's voice characteristics from a dataset
95
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")