Files changed (4) hide show
  1. LJ050-0075.wav +0 -0
  2. README.md +151 -1
  3. diffwave.ckpt +3 -0
  4. hyperparams.yaml +44 -0
LJ050-0075.wav ADDED
Binary file (86.9 kB). View file
 
README.md CHANGED
@@ -1,3 +1,153 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: "en"
3
+ inference: false
4
+ tags:
5
+ - Vocoder
6
+ - DiffWave
7
+ - text-to-speech
8
+ - TTS
9
+ - speech-synthesis
10
+ - speechbrain
11
+ license: "apache-2.0"
12
+ datasets:
13
+ - LJSpeech
14
  ---
15
+
16
+ # Vocoder with DiffWave trained on LJSpeech
17
+
18
+ This repository provides all the necessary tools for using a [DiffWave](https://arxiv.org/pdf/2009.09761.pdf) vocoder trained with [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
19
+
20
+ The pre-trained model takes as input a spectrogram and generates a waveform as output. Typically, a vocoder is used after a TTS model that converts an input text into a spectrogram.
21
+
22
+ The sampling frequency is 22050 Hz.
23
+
24
+
25
+ ## Install SpeechBrain
26
+
27
+ ```bash
28
+ pip install speechbrain
29
+ ```
30
+
31
+ Please notice that we encourage you to read our tutorials and learn more about
32
+ [SpeechBrain](https://speechbrain.github.io).
33
+
34
+ ### Using the Vocoder as reconstructor
35
+ ```python
36
+ import torch
37
+ import torchaudio
38
+ import speechbrain as sb
39
+ from speechbrain.pretrained import DiffWaveVocoder
40
+ from speechbrain.lobes.models.HifiGAN import mel_spectogram
41
+
42
+ diffwave = DiffWaveVocoder.from_hparams(source="speechbrain/tts-diffwave-ljspeech", savedir="tmpdir")
43
+
44
+ audio = sb.dataio.dataio.read_audio("speechbrain/tts-diffwave-ljspeech/LJ050-0075.wav")
45
+ audio = torch.FloatTensor(audio)
46
+ audio = audio.unsqueeze(0)
47
+
48
+ mel = mel_spectogram(
49
+ sample_rate=22050,
50
+ hop_length=256,
51
+ win_length=1024,
52
+ n_fft=1024,
53
+ n_mels=80,
54
+ f_min=0,
55
+ f_max=8000,
56
+ power=1.0,
57
+ normalized=False,
58
+ norm="slaney",
59
+ mel_scale="slaney",
60
+ compression=True,
61
+ audio=audio,
62
+ )
63
+
64
+ # Running Vocoder (spectrogram-to-waveform), a fast sampling can be realized by passing user-defined variance schedules. According to the paper, high-quality audios can be generated with only 6 steps (instead of a total of 50).
65
+ waveforms = diffwave.decode_batch(
66
+ mel,
67
+ hop_len=256, # upsample factor, should be the same as "hop_len" during the extraction of mel-spectrogram
68
+ fast_sampling=True, # fast sampling is highly recommanded
69
+ fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5], # customized noise schedule
70
+ )
71
+
72
+ torchaudio.save('reconstructed.wav', waveforms.squeeze(1), 22050)
73
+ ```
74
+
75
+ ### Using the Vocoder with TTS
76
+ ```python
77
+ import torchaudio
78
+ from speechbrain.pretrained import FastSpeech2
79
+ from speechbrain.pretrained import DiffWaveVocoder
80
+
81
+ # Intialize TTS (FastSpeech2) and Vocoder (DiffWave)
82
+ fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir="tmpdir_tts")
83
+ diffwave = DiffWaveVocoder.from_hparams(source="speechbrain/tts-diffwave-ljspeech", savedir="tmpdir_vocoder")
84
+
85
+ input_text = "This is a test run with FastSpeech and DiffWave."
86
+
87
+ # Running the TTS
88
+ mel_output, durations, pitch, energy = fastspeech2.encode_text(
89
+ [input_text],
90
+ pace=1.0, # scale up/down the speed
91
+ pitch_rate=1.0, # scale up/down the pitch
92
+ energy_rate=1.0, # scale up/down the energy
93
+ )
94
+
95
+ # Running Vocoder (spectrogram-to-waveform), a fast sampling can be realized by passing user-defined variance schedules. According to the paper, high-quality audios can be generated with only 6 steps (instead of a total of 50).
96
+ waveforms = diffwave.decode_batch(
97
+ mel_output,
98
+ hop_len=256, # upsample factor, should be the same as "hop_len" during the extraction of mel-spectrogram
99
+ fast_sampling=True, # fast sampling is highly recommanded
100
+ fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5], # customized noise schedule
101
+ )
102
+
103
+ # Save the waverform
104
+ torchaudio.save('example_TTS.wav',waveforms.squeeze(1), 22050)
105
+ ```
106
+
107
+ ### Inference on GPU
108
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
109
+
110
+ ### Training
111
+ The model was trained with SpeechBrain.
112
+ To train it from scratch follow these steps:
113
+ 1. Clone SpeechBrain:
114
+ ```bash
115
+ git clone https://github.com/speechbrain/speechbrain/
116
+ ```
117
+ 2. Install it:
118
+ ```bash
119
+ cd speechbrain
120
+ pip install -r requirements.txt
121
+ pip install -e .
122
+ ```
123
+ 3. Run Training:
124
+ ```bash
125
+ cd recipes/LJSpeech/TTS/vocoder/diffwave/
126
+ python train.py hparams/train.yaml --data_folder /path/to/LJspeech
127
+ ```
128
+ You can find our training results (models, logs, etc) [here](https://www.dropbox.com/sh/tbhpn1xirtaix68/AACvYaVDiUGAKURf2o-fvgMoa?dl=0).
129
+
130
+
131
+ ### Limitations
132
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
133
+
134
+ # **About SpeechBrain**
135
+ - Website: https://speechbrain.github.io/
136
+ - Code: https://github.com/speechbrain/speechbrain/
137
+ - HuggingFace: https://huggingface.co/speechbrain/
138
+
139
+
140
+ # **Citing SpeechBrain**
141
+ Please, cite SpeechBrain if you use it for your research or business.
142
+
143
+ ```bibtex
144
+ @misc{speechbrain,
145
+ title={{SpeechBrain}: A General-Purpose Speech Toolkit},
146
+ author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
147
+ year={2021},
148
+ eprint={2106.04624},
149
+ archivePrefix={arXiv},
150
+ primaryClass={eess.AS},
151
+ note={arXiv:2106.04624}
152
+ }
153
+ ```
diffwave.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b434ac56f45486ec899e4ef50fb1bab09443bbe4ccc844acae35486e89541fc2
3
+ size 10582085
hyperparams.yaml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ################################################
2
+ # Basic parameters for a diffwave vocoder
3
+ #
4
+ # Author:
5
+ # * Yingzhi Wang 2022
6
+ # ################################################
7
+
8
+ train_timesteps: 50
9
+ beta_start: 0.0001
10
+ beta_end: 0.05
11
+
12
+ residual_layers: 30
13
+ residual_channels: 64
14
+ dilation_cycle_length: 10
15
+
16
+ unconditional: False
17
+
18
+ spec_n_mels: 80
19
+ spec_hop_length: 256
20
+
21
+ diffwave: !new:speechbrain.lobes.models.DiffWave.DiffWave
22
+ input_channels: !ref <spec_n_mels>
23
+ residual_layers: !ref <residual_layers>
24
+ residual_channels: !ref <residual_channels>
25
+ dilation_cycle_length: !ref <dilation_cycle_length>
26
+ total_steps: !ref <train_timesteps>
27
+ unconditional: !ref <unconditional>
28
+
29
+ noise: !new:speechbrain.nnet.diffusion.GaussianNoise
30
+
31
+ diffusion: !new:speechbrain.lobes.models.DiffWave.DiffWaveDiffusion
32
+ model: !ref <diffwave>
33
+ beta_start: !ref <beta_start>
34
+ beta_end: !ref <beta_end>
35
+ timesteps: !ref <train_timesteps>
36
+ noise: !ref <noise>
37
+
38
+ modules:
39
+ diffwave: !ref <diffwave>
40
+ diffusion: !ref <diffusion>
41
+
42
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
43
+ loadables:
44
+ diffwave: !ref <diffwave>