xihan123 commited on
Commit
e96a1ea
β€’
1 Parent(s): 9038456

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +316 -1
README.md CHANGED
@@ -1,3 +1,318 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: afl-3.0
3
- ---
 
1
+ <div align="center">
2
+ <h1> Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS </h1>
3
+
4
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/maxmax20160403/sovits5.0)
5
+ [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PY1E4bDAeHbAD4r99D_oYXB46fG8nIA5?usp=sharing)
6
+ <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PlayVoice/so-vits-svc-5.0">
7
+ <img alt="GitHub forks" src="https://img.shields.io/github/forks/PlayVoice/so-vits-svc-5.0">
8
+ <img alt="GitHub issues" src="https://img.shields.io/github/issues/PlayVoice/so-vits-svc-5.0">
9
+ <img alt="GitHub" src="https://img.shields.io/github/license/PlayVoice/so-vits-svc-5.0">
10
+
11
+ </div>
12
+
13
+ - πŸ’—This project is target for: beginners in deep learning, the basic operation of Python and PyTorch is the prerequisite for using this project;
14
+ - πŸ’—This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practice;
15
+ - πŸ’—This project does not support real-time voice change; (support needs to replace whisper)
16
+ - πŸ’—This project will not develop one-click packages for other purposesοΌ›
17
+
18
+ ![sovits_framework](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/402cf58d-6d03-4d0b-9d6a-94f079898672)
19
+
20
+ - 6G memory GPU can be used to trained
21
+
22
+ - support for multiple speakers
23
+
24
+ - create unique speakers through speaker mixing
25
+
26
+ - even with light accompaniment can also be converted
27
+
28
+ - F0 can be edited using Excel
29
+
30
+ ## Model properties
31
+
32
+ https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/hifigan_release
33
+
34
+ - [sovits5.0_main_1500.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/download/hifigan_release/sovits5.0_main_1500.pth) The model includes: generator + discriminator = 176M, which can be used as a pre-training model
35
+ - speakers files are in the configs/singers directory, which can be used for reasoning tests, especially for timbre leakage
36
+ - speakers 22, 30, 47, and 51 are highly recognizable, and the training audio samples are in the configs/singers_sample directory
37
+
38
+ | Feature | From | Status | Function | Remarks |
39
+ | --- | --- | --- | --- | --- |
40
+ | whisper | OpenAI | βœ… | strong noise immunity | - |
41
+ | bigvgan | NVIDA | βœ… | alias and snake | The GPU takes up a little more, and the main branch is deleted; You need to switch to the branch [bigvgan](https://github.com/PlayVoice/so-vits-svc-5.0/tree/bigvgan),the formant is clearer and the sound quality is obviously improved |
42
+ | natural speech | Microsoft | βœ… | reduce mispronunciation | - |
43
+ | neural source-filter | NII | βœ… | solve the problem of audio F0 discontinuity | - |
44
+ | speaker encoder | Google | βœ… | Timbre Encoding and Clustering | - |
45
+ | GRL for speaker | Ubisoft |βœ… | Preventing Encoder Leakage Timbre | - |
46
+ | one shot vits | Samsung | βœ… | Voice Clone | - |
47
+ | SCLN | Microsoft | βœ… | Improve Clone | - |
48
+ | PPG perturbation | this project | βœ… | Improved noise immunity and de-timbre | - |
49
+ | VAE perturbation | this project | βœ… | Improve sound quality | - |
50
+
51
+ πŸ’—due to the use of data perturbation, it takes longer to train than other projects.
52
+
53
+ ## Dataset preparation
54
+
55
+ Necessary pre-processing:
56
+ - 1 accompaniment separation
57
+ - 2 band extension
58
+ - 3 sound quality improvement
59
+ - 4 cut audio, less than 30 seconds for whisperπŸ’—
60
+
61
+ then put the dataset into the dataset_raw directory according to the following file structure
62
+ ```shell
63
+ dataset_raw
64
+ β”œβ”€β”€β”€speaker0
65
+ β”‚ β”œβ”€β”€β”€000001.wav
66
+ β”‚ β”œβ”€β”€β”€...
67
+ β”‚ └───000xxx.wav
68
+ └───speaker1
69
+ β”œβ”€β”€β”€000001.wav
70
+ β”œβ”€β”€β”€...
71
+ └───000xxx.wav
72
+ ```
73
+
74
+ ## Install dependencies
75
+
76
+ - 1 software dependency
77
+
78
+ > apt update && sudo apt install ffmpeg
79
+
80
+ > pip install -r requirements.txt
81
+
82
+ - 2 download the Timbre Encoder: [Speaker-Encoder by @mueller91](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), put `best_model.pth.tar` into `speaker_pretrain/`
83
+
84
+ - 3 download whisper model [multiple language medium model](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt), Make sure to download `medium.pt`,put it into `whisper_pretrain/`
85
+
86
+ - 4 whisper is built-in, do not install it additionally, it will conflict and report an error
87
+
88
+ ## Data preprocessing
89
+ - 1, set working directory:
90
+
91
+ > export PYTHONPATH=$PWD
92
+
93
+ - 2, re-sampling
94
+
95
+ generate audio with a sampling rate of 16000Hz:./data_svc/waves-16k
96
+
97
+ > python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
98
+
99
+ generate audio with a sampling rate of 32000Hz:./data_svc/waves-32k
100
+
101
+ > python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
102
+
103
+ - 3, use 16K audio to extract pitch:f0_ceil=900, it needs to be modified according to the highest pitch of your data
104
+ > python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch
105
+
106
+ or use next for low quality audio
107
+
108
+ > python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
109
+
110
+ - 4, use 16K audio to extract ppg
111
+ > python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
112
+
113
+ - 5, use 16k audio to extract timbre code
114
+ > python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
115
+
116
+ - 6, extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
117
+ > python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
118
+
119
+ - 7, use 32k audio to extract the linear spectrum
120
+ > python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
121
+
122
+ - 8, use 32k audio to generate training index
123
+ > python prepare/preprocess_train.py
124
+
125
+ - 9, training file debugging
126
+ > python prepare/preprocess_zzz.py
127
+
128
+ ```shell
129
+ data_svc/
130
+ └── waves-16k
131
+ β”‚ └── speaker0
132
+ β”‚ β”‚ β”œβ”€β”€ 000001.wav
133
+ β”‚ β”‚ └── 000xxx.wav
134
+ β”‚ └── speaker1
135
+ β”‚ β”œβ”€β”€ 000001.wav
136
+ β”‚ └── 000xxx.wav
137
+ └── waves-32k
138
+ β”‚ └── speaker0
139
+ β”‚ β”‚ β”œβ”€β”€ 000001.wav
140
+ β”‚ β”‚ └── 000xxx.wav
141
+ β”‚ └── speaker1
142
+ β”‚ β”œβ”€β”€ 000001.wav
143
+ β”‚ └── 000xxx.wav
144
+ └── pitch
145
+ β”‚ └── speaker0
146
+ β”‚ β”‚ β”œβ”€β”€ 000001.pit.npy
147
+ β”‚ β”‚ └── 000xxx.pit.npy
148
+ β”‚ └── speaker1
149
+ β”‚ β”œβ”€β”€ 000001.pit.npy
150
+ β”‚ └── 000xxx.pit.npy
151
+ └── whisper
152
+ β”‚ └── speaker0
153
+ β”‚ β”‚ β”œβ”€β”€ 000001.ppg.npy
154
+ β”‚ β”‚ └── 000xxx.ppg.npy
155
+ β”‚ └── speaker1
156
+ β”‚ β”œβ”€β”€ 000001.ppg.npy
157
+ β”‚ └── 000xxx.ppg.npy
158
+ └── speaker
159
+ β”‚ └── speaker0
160
+ β”‚ β”‚ β”œβ”€β”€ 000001.spk.npy
161
+ β”‚ β”‚ └── 000xxx.spk.npy
162
+ β”‚ └── speaker1
163
+ β”‚ β”œβ”€β”€ 000001.spk.npy
164
+ β”‚ └── 000xxx.spk.npy
165
+ └── singer
166
+ β”œβ”€β”€ speaker0.spk.npy
167
+ └── speaker1.spk.npy
168
+ ```
169
+
170
+ ## Train
171
+ - 0, if fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0_main_1500.pth
172
+
173
+ > set pretrain: "./sovits5.0_main_1500.pth" in configs/base.yaml,and adjust the learning rate appropriately, eg 1e-5
174
+
175
+ - 1, set working directory
176
+
177
+ > export PYTHONPATH=$PWD
178
+
179
+ - 2, start training
180
+
181
+ > python svc_trainer.py -c configs/base.yaml -n sovits5.0
182
+
183
+ - 3, resume training
184
+
185
+ > python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth
186
+
187
+ - 4, view log
188
+
189
+ > tensorboard --logdir logs/
190
+
191
+ ![sovits5 0_base](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/1628e775-5888-4eac-b173-a28dca978faa)
192
+
193
+ ## Inference
194
+
195
+ - 1, set working directory
196
+
197
+ > export PYTHONPATH=$PWD
198
+
199
+ - 2, export inference model: text encoder, Flow network, Decoder network
200
+
201
+ > python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
202
+
203
+ - 3, use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage
204
+
205
+ > python whisper/inference.py -w test.wav -p test.ppg.npy
206
+
207
+ generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically
208
+
209
+ - 4, extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
210
+
211
+ > python pitch/inference.py -w test.wav -p test.csv
212
+
213
+ - 5,specify parameters and infer
214
+
215
+ > python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --pit test.csv
216
+
217
+ when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
218
+
219
+ when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
220
+
221
+ generate files in the current directory:svc_out.wav
222
+
223
+ | args |--config | --model | --spk | --wave | --ppg | --pit | --shift |
224
+ | --- | --- | --- | --- | --- | --- | --- | --- |
225
+ | name | config path | model path | speaker | wave input | wave ppg | wave pitch | pitch shift |
226
+
227
+ ## Creat singer
228
+ named by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction
229
+
230
+ > python svc_eva.py
231
+
232
+ ```python
233
+ eva_conf = {
234
+ './configs/singers/singer0022.npy': 0,
235
+ './configs/singers/singer0030.npy': 0,
236
+ './configs/singers/singer0047.npy': 0.5,
237
+ './configs/singers/singer0051.npy': 0.5,
238
+ }
239
+ ```
240
+
241
+ the generated singer file is:eva.spk.npy
242
+
243
+ πŸ’—both Flow and Decoder need to input timbres, and you can even input different timbre parameters to the two modules to create more unique timbres.
244
+
245
+ ## Data set
246
+
247
+ | Name | URL |
248
+ | --- | --- |
249
+ |KiSing |http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/|
250
+ |PopCS |https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md|
251
+ |opencpop |https://wenet.org.cn/opencpop/download/|
252
+ |Multi-Singer |https://github.com/Multi-Singer/Multi-Singer.github.io|
253
+ |M4Singer |https://github.com/M4Singer/M4Singer/blob/master/apply_form.md|
254
+ |CSD |https://zenodo.org/record/4785016#.YxqrTbaOMU4|
255
+ |KSS |https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset|
256
+ |JVS MuSic |https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music|
257
+ |PJS |https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus|
258
+ |JUST Song |https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song|
259
+ |MUSDB18 |https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems|
260
+ |DSD100 |https://sigsep.github.io/datasets/dsd100.html|
261
+ |Aishell-3 |http://www.aishelltech.com/aishell_3|
262
+ |VCTK |https://datashare.ed.ac.uk/handle/10283/2651|
263
+
264
+ ## Code sources and references
265
+
266
+ https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355)
267
+
268
+ https://github.com/jaywalnut310/vits [paper](https://arxiv.org/abs/2106.06103)
269
+
270
+ https://github.com/openai/whisper/ [paper](https://arxiv.org/abs/2212.04356)
271
+
272
+ https://github.com/NVIDIA/BigVGAN [paper](https://arxiv.org/abs/2206.04658)
273
+
274
+ https://github.com/mindslab-ai/univnet [paper](https://arxiv.org/abs/2106.07889)
275
+
276
+ https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
277
+
278
+ https://github.com/brentspell/hifi-gan-bwe
279
+
280
+ https://github.com/mozilla/TTS
281
+
282
+ https://github.com/OlaWod/FreeVC [paper](https://arxiv.org/abs/2210.15418)
283
+
284
+ [SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech](https://github.com/hcy71o/SNAC)
285
+
286
+ [Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585)
287
+
288
+ [AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/pdf/2103.00993.pdf)
289
+
290
+ [Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis](https://github.com/ubisoft/ubisoft-laforge-daft-exprt)
291
+
292
+ [Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings](https://arxiv.org/abs/2305.05401)
293
+
294
+ [Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion](https://arxiv.org/pdf/2305.09167.pdf)
295
+
296
+ [Speaker normalization (GRL) for self-supervised speech emotion recognition](https://arxiv.org/abs/2202.01252)
297
+
298
+ ## Method of Preventing Timbre Leakage Based on Data Perturbation
299
+
300
+ https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
301
+
302
+ https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
303
+
304
+ https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
305
+
306
+ https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
307
+
308
+ https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
309
+
310
+ ## Contributors
311
+
312
+ <a href="https://github.com/PlayVoice/so-vits-svc/graphs/contributors">
313
+ <img src="https://contrib.rocks/image?repo=PlayVoice/so-vits-svc" />
314
+ </a>
315
+
316
  ---
317
  license: afl-3.0
318
+ ---