Sang-Hoon Lee commited on
Commit
aca1ebd
1 Parent(s): 0164e4a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -269
README.md CHANGED
@@ -1,269 +1,12 @@
1
- # HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis
2
- The official implementation of HierSpeech2 | [Paper]() | [Demo page](https://sh-lee-prml.github.io/HierSpeechpp-demo/) | [Checkpoint](https://drive.google.com/drive/folders/1-L_90BlCkbPyKWWHTUjt5Fsu3kz0du0w?usp=sharing) |
3
-
4
- **Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee**
5
-
6
- Department of Artificial Intelligence, Korea University, Seoul, Korea
7
-
8
- ## Abstract
9
- ![image](https://github.com/sh-lee-prml/HierSpeechpp/assets/56749640/732bc183-bf11-4f32-84a9-e9eab8190e1a)
10
- <details>
11
- <summary> [Abs.] Sorry for too long abstract😅 </summary>
12
-
13
-
14
- Recently, large-scale language models (LLM)-based speech synthesis has shown a significant performance in zero-shot speech synthesis. However, they require a large-scale data and even suffer from the same limitation of previous autoregressive speech models such as slow inference speed and lack of robustness. Following the previous powerful end-to-end text-to-speech framework of VITS (but now that's what we call classical), this paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). In the previous our works (HierSpeech and HierVST), we verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech by adopting hierarchical variational autoencoder and leveraging self-supervised speech represenation as an additional linguistic information to bridge an information gap between text and speech. In this work, we once again significantly improve the naturalness and speaker similarity of the synthetic speech even in the zero-shot speech synthesis scenarios. We first introduce multi-audio acoustic encoder for the enhanced acoustics posterior, and adopt a hierarchical adaptive waveform generator with conditional/unconditional generation. Second, we additionally utilize a F0 information and introduce source-filter theory-based multi-path semantic encoder for speaker-agnostic and speaker-related semantic representation. We also leverage hierarchical variational autoencoder to connect multiple representations, and present a BiT-Flow which is a bidirectional normalizing flow Transformer networks with AdaLN-Zero for better speaker adaptation and train-inference mismatch reduction. Without any text transcripts, we only utilize the speech dataset to train the speech synthesizer for data flexibility. For text-to-speech, we introduce text-to-vec (TTV) frameworks to generate a self-supervised speech representation and F0 representation from text representation and prosody prompt. Then, the speech synthesizer of HierSpeech++ generates the speech from generated vector, F0, and voice prompt. In addition, we propose the high-efficient speech super-resolution framework which can upsample the waveform audio from 16 kHz to 48 kHz, and this facilitate training the speech synthesizer in that we can use easily available low-resolution (16 kHz) speech data for scaling-up. The experimental results demonstrated that hierarchical variational autoencoder could be a strong zero-shot speech synthesizer by beating LLM-based models and diffusion-based models for TTS and VC tasks. Furthermore, we also verify the data efficiency in that our model trained with a small dataset still shows a better performance in both naturalness and similarity than other models trained with large-scale dataset. Moreover, we achieve the first human-level quality in zero-shot speech synthesis.
15
- </details>
16
-
17
- This repository contains:
18
-
19
- - 🪐 A PyTorch implementation of HierSpeech++ (TTV, Hierarchical Speech Synthesizer, SpeechSR)
20
- - ⚡️ Pre-trained HierSpeech++ models trained on LibriTTS (Train-460, Train-960, and more dataset)
21
-
22
- <!--
23
- - 💥 A Colab notebook for running pre-trained HierSpeech++ models (Soon..)
24
- 🛸 A HierSpeech++ training script (Will be released soon)
25
- -->
26
- ## Previous Our Works
27
- - [1] HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis
28
- - [2] HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer
29
-
30
- This paper is an extenstion version of above papers.
31
-
32
- ## Todo
33
- ### Hierarchical Speech Synthesizer
34
- - [x] HierSpeechpp-Backbone
35
- <!--
36
- - [ ] HierSpeech-Lite (Fast and Efficient Zero-shot Speech Synthesizer)
37
- - [ ] HierSinger (Zero-shot Singing Voice Synthesizer)
38
- - [ ] HierSpeech2-24k-Large-Full (For High-resolutional and High-quality Speech Synthesizer)
39
- - [ ] HierSpeech2-48k-Large-Full (For Industrial-level High-resolution and High-quality Speech Synthesizer)
40
- -->
41
- ### Text-to-Vec (TTV)
42
- - [x] TTV-v1 (LibriTTS-train-960)
43
- - [ ] TTV-v2 (We are currently training a multi-lingual TTV model)
44
- <!--
45
- - [ ] Hierarchical Text-to-Vec (For Much More Expressive Text-to-Speech)
46
- -->
47
- ### Speech Super-resolution (16k --> 24k or 48k)
48
- - [x] SpeechSR-24k
49
- - [x] SpeechSR-48k
50
- ### Training code (Will be released after paper acceptance)
51
- - [ ] TTV
52
- - [ ] Hierarchical Speech Synthesizer
53
- - [ ] SpeechSR
54
- ## Getting Started
55
-
56
- ### Pre-requisites
57
- 0. Pytorch >=1.13 and torchaudio >= 0.13
58
- 1. Install requirements
59
- ```
60
- pip install -r requirements.txt
61
- ```
62
- 2. Install Phonemizer
63
- ```
64
- pip install phonemizer
65
- sudo apt-get install espeak-ng
66
- ```
67
-
68
- ## Checkpoint [[Download]](https://drive.google.com/drive/folders/1-L_90BlCkbPyKWWHTUjt5Fsu3kz0du0w?usp=sharing)
69
- ### Hierarchical Speech Synthesizer
70
- | Model |Sampling Rate|Params|Dataset|Hour|Speaker|Checkpoint|
71
- |------|:---:|:---:|:---:|:---:|:---:|:---:|
72
- | HierSpeech2|16 kHz|97M| LibriTTS (train-460) |245|1,151|[[Download]](https://drive.google.com/drive/folders/14FTu0ZWux0zAD7ev4O1l6lKslQcdmebL?usp=sharing)|
73
- | HierSpeech2|16 kHz|97M| LibriTTS (train-960) |555|2,311|[[Download]](https://drive.google.com/drive/folders/1sFQP-8iS8z9ofCkE7szXNM_JEy4nKg41?usp=drive_link)|
74
- | HierSpeech2|16 kHz|97M| LibriTTS (train-960), Libri-light (Small, Medium), Expresso, MMS(Kor), NIKL(Kor)|2,796| 7,299 |[[Download]](https://drive.google.com/drive/folders/14jaDUBgrjVA7bCODJqAEirDwRlvJe272?usp=drive_link)|
75
-
76
- <!--
77
- | HierSpeech2-Lite|16 kHz|-| LibriTTS (train-960)) |-|
78
- | HierSpeech2-Lite|16 kHz|-| LibriTTS (train-960) NIKL, AudioBook-Korean) |-|
79
- | HierSpeech2-Large-CL|16 kHz|200M| LibriTTS (train-960), Libri-Light, NIKL, AudioBook-Korean, Japanese, Chinese, CSS, MLS) |-|
80
- -->
81
-
82
- ### TTV
83
- | Model |Language|Params|Dataset|Hour|Speaker|Checkpoint|
84
- |------|:---:|:---:|:---:|:---:|:---:|:---:|
85
- | TTV |Eng|107M| LibriTTS (train-960) |555|2,311|[[Download]](https://drive.google.com/drive/folders/1QiFFdPhqhiLFo8VXc0x7cFHKXArx7Xza?usp=drive_link)|
86
-
87
-
88
- <!--
89
- | TTV |Kor|100M| NIKL |114|118|-|
90
- | TTV |Eng|50M| LibriTTS (train-960) |555|2,311|-|
91
- | TTV-Large |Eng|100M| LibriTTS (train-960) |555|2,311|-|
92
- | TTV-Lite |Eng|10M| LibriTTS (train-960) |555|2,311|-|
93
- | TTV |Kor|50M| NIKL |114|118|-|
94
- -->
95
- ### SpeechSR
96
- | Model |Sampling Rate|Params|Dataset |Checkpoint|
97
- |------|:---:|:---:|:---:|:---:|
98
- | SpeechSR-24k |16kHz --> 24 kHz|0.13M| LibriTTS (train-960), MMS (Kor) |[speechsr24k](https://github.com/sh-lee-prml/HierSpeechpp/blob/main/speechsr24k/G_340000.pth)|
99
- | SpeechSR-48k |16kHz --> 48 kHz|0.13M| MMS (Kor), Expresso (Eng), VCTK (Eng)|[speechsr48k](https://github.com/sh-lee-prml/HierSpeechpp/blob/main/speechsr48k/G_100000.pth)|
100
-
101
- ## Text-to-Speech
102
- ```
103
- sh inference.sh
104
-
105
- # --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
106
- # --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
107
- # --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
108
- # --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v2_ckpt.pth" \ Large_v2 epoch 110 (08. Nov. 2023)
109
-
110
- CUDA_VISIBLE_DEVICES=0 python3 inference.py \
111
- --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v2_ckpt.pth" \
112
- --ckpt_text2w2v "logs/ttv_libritts_v1/ttv_lt960_ckpt.pth" \
113
- --output_dir "tts_results_eng_kor_v2" \
114
- --noise_scale_vc "0.333" \
115
- --noise_scale_ttv "0.333" \
116
- --denoise_ratio "0"
117
-
118
- ```
119
- - For better robustness, we recommend a noise_scale of 0.333
120
- - For better expressiveness, we recommend a noise_scale of 0.667
121
- - Find your best parameters for your style prompt 😵
122
- ### Noise Control
123
- ```
124
- # without denoiser
125
- --denoise_ratio "0"
126
- # with denoiser
127
- --denoise_ratio "1"
128
- # Mixup (Recommended 0.6~0.8)
129
- --denoise_rate "0.8"
130
- ```
131
- ## Voice Conversion
132
- - This method only utilize a hierarchical speech synthesizer for voice conversion.
133
- ```
134
- sh inference_vc.sh
135
-
136
- # --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
137
- # --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
138
- # --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
139
- # --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v2_ckpt.pth" \ Large_v2 epoch 110 (08. Nov. 2023)
140
-
141
- CUDA_VISIBLE_DEVICES=0 python3 inference_vc.py \
142
- --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v2_ckpt.pth" \
143
- --output_dir "vc_results_eng_kor_v2" \
144
- --noise_scale_vc "0.333" \
145
- --noise_scale_ttv "0.333" \
146
- --denoise_ratio "0"
147
- ```
148
- - For better robustness, we recommend a noise_scale of 0.333
149
- - For better expressiveness, we recommend a noise_scale of 0.667
150
- - Find your best parameters for your style prompt 😵
151
- - Voice Conversion is vulnerable to noisy target prompt so we recommend to utilize a denoiser with noisy prompt
152
- - For noisy source speech, a wrong F0 may be extracted by YAPPT resulting in a quality degradation.
153
-
154
-
155
- ## Speech Super-resolution
156
- - SpeechSR-24k and SpeechSR-48 are provided in TTS pipeline. If you want to use SpeechSR only, please refer inference_speechsr.py.
157
- - If you change the output resolution, add this
158
- ```
159
- --output_sr "48000" # Default
160
- --output_sr "24000" #
161
- --output_sr "16000" # without super-resolution.
162
- ```
163
- ## Speech Denoising for Noise-free Speech Synthesis (Only used in Speaker Encoder during Inference)
164
- - For denoised style prompt, we utilize a denoiser [(MP-SENet)](https://github.com/yxlu-0102/MP-SENet).
165
- - When using a long reference audio, there is an out-of-memory issue with this model so we have a plan to learn a memory efficient speech denoiser in the future.
166
- - If you have a problem, we recommend to use a clean reference audio or denoised audio before TTS pipeline or denoise the audio with cpu (but this will be slow😥).
167
-
168
- ## TTV-v2
169
- - TTV-v1 is a simple model which is very slightly modified from VITS. Although this simple TTV could synthesize a speech with high-quality and high speaker similarity, we thought that there is room for improvement in terms of expressiveness such as prosody modeling.
170
- - For TTV-v2, we modify some components and training process (Model size: 107M --> 278M)
171
- 1. Intermediate hidden size: 256 --> 384
172
- 2. Loss masking for wav2vec reconstruction loss (I left out masking the loss for zero-padding sequences😥)
173
- 3. For long sentence generation, we finetune the model with full LibriTTS-train dataset without data filtering (Decrease the learning rate to 2e-5 with batch size of 8 per gpus)
174
-
175
- ## GAN VS Diffusion
176
- <details>
177
- <summary> [Read More] </summary>
178
- We think that we could not confirm which is better yet. There are many advatanges for each model so you can utilize each model for your own purposes and each study must be actively conducted simultaneously.
179
-
180
- ### GAN (Specifically, GAN-based End-to-End Speech Synthesis Models)
181
- - (pros) Fast Inference Speed
182
- - (pros) High-quality Audio
183
- - (cons) Slow Training Speed (Over 7~20 Days)
184
- - (cons) Lower Voice Style Transfer Performance than Diffusion Models
185
- - (cons) Perceptually High-quality but Over-smoothed Audio because of Information Bottleneck by the sampling from the low-dimensional Latent Variable
186
-
187
- ### Diffusion (Diffusion-based Mel-spectrogram Generation Models)
188
- - (pros) Fast Training Speed (within 3 Days)
189
- - (pros) High-quality Voice Style Transfer
190
- - (cons) Slow Inference Speed
191
- - (cons) Lower Audio quality than End-to-End Speech Synthesis Models
192
-
193
- ### (In this wors) Our Approaches for GAN-based End-to-End Speech Synthesis Models
194
- - Improving Voice Style Transfer Performance in End-to-End Speech Synthesis Models for OOD (Zero-shot Voice Style Transfer for Novel Speaker)
195
- - Improving the Audio Quality beyond Perceptal Quality for Much more High-fidelity Audio Generation
196
-
197
- ### (Our other works) Diffusion-based Mel-spectrogram Generation Models
198
- - DDDM-VC: Disentangled Denoising Diffusion Models for High-quality and High-diversity Speech Synthesis Models
199
- - Diff-hierVC: Hierarhical Diffusion-based Speech Synthesis Model with Diffusion-based Pitch Modeling
200
-
201
- ### Our Goals
202
- - Integrating each model for High-quality, High-diversity and High-fidelity Speech Synthesis Models
203
- </details>
204
-
205
- ## LLM-based Models
206
- We hope to compare LLM-based models for zero-shot TTS baselines. However, there is no public-available official implementation of LLM-based TTS models. Unfortunately, unofficial models have a poor performance in zero-shot TTS so we hope they will release their model for a fair comparison and reproducibility and for our speech community. THB I could not stand the inference speed almost 1,000 times slower than e2e models It takes 5 days to synthesize the full sentences of LibriTTS-test subsets. Even, the audio quality is so bad. I hope they will release their official source code soon.
207
-
208
- In my very personal opinion, VITS is still the best TTS model I have ever seen. But, I acknowledge that LLM-based models have much powerful potential for their creative generative performance from the large-scale dataset but not now.
209
-
210
- ## Limitation of our work
211
- - Slow training speed and Relatively large model size (Compared with VITS) --> Future work: Light-weight and Fast training pipeline and much larger model...
212
- - Could not generate realistic background sound --> Future work: adding audio generation part by disentangling speech and sound.
213
- - Could not generate a speech from a too long sentence becauase of our training setting. We see increasing max length could improve the model performance. However, we do not have GPUs with 80 GB 😢
214
- ```
215
- # Data Filtering for limited computation resource.
216
- wav_min = 32
217
- wav_max = 600 # 12s
218
- text_min = 1
219
- text_max = 200
220
- ```
221
- TTV v2 may reduce this issue significantly...!
222
-
223
- ## Results [[Download]](https://drive.google.com/drive/folders/1xCrZQy9s5MT38RMQxKAtkoWUgxT5qYYW?usp=sharing)
224
- We have attached all samples from LibriTTS test-clean and test-other.
225
-
226
- ## Reference
227
- <details>
228
- <summary> [Read More] </summary>
229
-
230
- ### Our Previous Works
231
- - HierSpeech/HierSpeech-U for Hierarchical Speech Synthesis Framework: https://openreview.net/forum?id=awdyRVnfQKX
232
- - HierVST for Baseline Speech Backbone: https://www.isca-speech.org/archive/interspeech_2023/lee23i_interspeech.html
233
- - DDDM-VC: https://dddm-vc.github.io/
234
- - Diff-HierVC: https://diff-hiervc.github.io/
235
-
236
- ### Baseline Model
237
- - VITS: https://github.com/jaywalnut310/vits
238
- - NaturalSpeech
239
- - NANSY for Audio Perturbation: https://github.com/revsic/torch-nansy
240
- - Speech Resynthesis: https://github.com/facebookresearch/speech-resynthesis
241
-
242
- ### Waveform Generator for High-quality Audio Generation
243
- - HiFi-GAN: https://github.com/jik876/hifi-gan
244
- - BigVGAN for High-quality Generator: https://arxiv.org/abs/2206.04658
245
- - UnivNET: https://github.com/mindslab-ai/univnet
246
- - EnCodec: https://github.com/facebookresearch/encodec
247
-
248
- ### Self-supervised Speech Model
249
- - Wav2Vec 2.0: https://arxiv.org/abs/2006.11477
250
- - XLS-R: https://huggingface.co/facebook/wav2vec2-xls-r-300m
251
- - MMS: https://huggingface.co/facebook/facebook/mms-300m
252
-
253
- ### Other Large Language Model based Speech Synthesis Model
254
- - VALL-E & VALL-E-X
255
- - SPEAR-TTS
256
- - NaturalSpeech 2
257
- - Make-a-Voice
258
- - MEGA-TTS & MEGA-TTS 2
259
- - UniAudio
260
-
261
- ### AdaLN-zero
262
- - Dit: https://github.com/facebookresearch/DiT
263
-
264
- Thanks for all nice works.
265
- </details>
266
-
267
- ## LICENSE
268
- - Code in this repo: MIT License
269
- - Model Weights: CC-BY-NC-4.0 license
 
1
+ ---
2
+ title: Test
3
+ emoji: ⚡
4
+ colorFrom: gray
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 4.4.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: cc-by-nc-4.0
11
+ ---
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference