Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,318 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: afl-3.0
|
3 |
-
---
|
|
|
1 |
+
<div align="center">
|
2 |
+
<h1> Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS </h1>
|
3 |
+
|
4 |
+
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/maxmax20160403/sovits5.0)
|
5 |
+
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PY1E4bDAeHbAD4r99D_oYXB46fG8nIA5?usp=sharing)
|
6 |
+
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PlayVoice/so-vits-svc-5.0">
|
7 |
+
<img alt="GitHub forks" src="https://img.shields.io/github/forks/PlayVoice/so-vits-svc-5.0">
|
8 |
+
<img alt="GitHub issues" src="https://img.shields.io/github/issues/PlayVoice/so-vits-svc-5.0">
|
9 |
+
<img alt="GitHub" src="https://img.shields.io/github/license/PlayVoice/so-vits-svc-5.0">
|
10 |
+
|
11 |
+
</div>
|
12 |
+
|
13 |
+
- πThis project is target for: beginners in deep learning, the basic operation of Python and PyTorch is the prerequisite for using this project;
|
14 |
+
- πThis project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practice;
|
15 |
+
- πThis project does not support real-time voice change; (support needs to replace whisper)
|
16 |
+
- πThis project will not develop one-click packages for other purposesοΌ
|
17 |
+
|
18 |
+
![sovits_framework](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/402cf58d-6d03-4d0b-9d6a-94f079898672)
|
19 |
+
|
20 |
+
- 6G memory GPU can be used to trained
|
21 |
+
|
22 |
+
- support for multiple speakers
|
23 |
+
|
24 |
+
- create unique speakers through speaker mixing
|
25 |
+
|
26 |
+
- even with light accompaniment can also be converted
|
27 |
+
|
28 |
+
- F0 can be edited using Excel
|
29 |
+
|
30 |
+
## Model properties
|
31 |
+
|
32 |
+
https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/hifigan_release
|
33 |
+
|
34 |
+
- [sovits5.0_main_1500.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/download/hifigan_release/sovits5.0_main_1500.pth) The model includes: generator + discriminator = 176M, which can be used as a pre-training model
|
35 |
+
- speakers files are in the configs/singers directory, which can be used for reasoning tests, especially for timbre leakage
|
36 |
+
- speakers 22, 30, 47, and 51 are highly recognizable, and the training audio samples are in the configs/singers_sample directory
|
37 |
+
|
38 |
+
| Feature | From | Status | Function | Remarks |
|
39 |
+
| --- | --- | --- | --- | --- |
|
40 |
+
| whisper | OpenAI | β
| strong noise immunity | - |
|
41 |
+
| bigvgan | NVIDA | β
| alias and snake | The GPU takes up a little more, and the main branch is deleted; You need to switch to the branch [bigvgan](https://github.com/PlayVoice/so-vits-svc-5.0/tree/bigvgan)οΌthe formant is clearer and the sound quality is obviously improved |
|
42 |
+
| natural speech | Microsoft | β
| reduce mispronunciation | - |
|
43 |
+
| neural source-filter | NII | β
| solve the problem of audio F0 discontinuity | - |
|
44 |
+
| speaker encoder | Google | β
| Timbre Encoding and Clustering | - |
|
45 |
+
| GRL for speaker | Ubisoft |β
| Preventing Encoder Leakage Timbre | - |
|
46 |
+
| one shot vits | Samsung | β
| Voice Clone | - |
|
47 |
+
| SCLN | Microsoft | β
| Improve Clone | - |
|
48 |
+
| PPG perturbation | this project | β
| Improved noise immunity and de-timbre | - |
|
49 |
+
| VAE perturbation | this project | β
| Improve sound quality | - |
|
50 |
+
|
51 |
+
πdue to the use of data perturbation, it takes longer to train than other projects.
|
52 |
+
|
53 |
+
## Dataset preparation
|
54 |
+
|
55 |
+
Necessary pre-processing:
|
56 |
+
- 1 accompaniment separation
|
57 |
+
- 2 band extension
|
58 |
+
- 3 sound quality improvement
|
59 |
+
- 4 cut audio, less than 30 seconds for whisperπ
|
60 |
+
|
61 |
+
then put the dataset into the dataset_raw directory according to the following file structure
|
62 |
+
```shell
|
63 |
+
dataset_raw
|
64 |
+
ββββspeaker0
|
65 |
+
β ββββ000001.wav
|
66 |
+
β ββββ...
|
67 |
+
β ββββ000xxx.wav
|
68 |
+
ββββspeaker1
|
69 |
+
ββββ000001.wav
|
70 |
+
ββββ...
|
71 |
+
ββββ000xxx.wav
|
72 |
+
```
|
73 |
+
|
74 |
+
## Install dependencies
|
75 |
+
|
76 |
+
- 1 software dependency
|
77 |
+
|
78 |
+
> apt update && sudo apt install ffmpeg
|
79 |
+
|
80 |
+
> pip install -r requirements.txt
|
81 |
+
|
82 |
+
- 2 download the Timbre Encoder: [Speaker-Encoder by @mueller91](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), put `best_model.pth.tar` into `speaker_pretrain/`
|
83 |
+
|
84 |
+
- 3 download whisper model [multiple language medium model](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt), Make sure to download `medium.pt`οΌput it into `whisper_pretrain/`
|
85 |
+
|
86 |
+
- 4 whisper is built-in, do not install it additionally, it will conflict and report an error
|
87 |
+
|
88 |
+
## Data preprocessing
|
89 |
+
- 1οΌ set working directory:
|
90 |
+
|
91 |
+
> export PYTHONPATH=$PWD
|
92 |
+
|
93 |
+
- 2οΌ re-sampling
|
94 |
+
|
95 |
+
generate audio with a sampling rate of 16000HzοΌ./data_svc/waves-16k
|
96 |
+
|
97 |
+
> python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
|
98 |
+
|
99 |
+
generate audio with a sampling rate of 32000HzοΌ./data_svc/waves-32k
|
100 |
+
|
101 |
+
> python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
|
102 |
+
|
103 |
+
- 3οΌ use 16K audio to extract pitchοΌf0_ceil=900, it needs to be modified according to the highest pitch of your data
|
104 |
+
> python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch
|
105 |
+
|
106 |
+
or use next for low quality audio
|
107 |
+
|
108 |
+
> python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
|
109 |
+
|
110 |
+
- 4οΌ use 16K audio to extract ppg
|
111 |
+
> python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
|
112 |
+
|
113 |
+
- 5οΌ use 16k audio to extract timbre code
|
114 |
+
> python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
|
115 |
+
|
116 |
+
- 6οΌ extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
|
117 |
+
> python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
|
118 |
+
|
119 |
+
- 7οΌ use 32k audio to extract the linear spectrum
|
120 |
+
> python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
|
121 |
+
|
122 |
+
- 8οΌ use 32k audio to generate training index
|
123 |
+
> python prepare/preprocess_train.py
|
124 |
+
|
125 |
+
- 9οΌ training file debugging
|
126 |
+
> python prepare/preprocess_zzz.py
|
127 |
+
|
128 |
+
```shell
|
129 |
+
data_svc/
|
130 |
+
βββ waves-16k
|
131 |
+
β βββ speaker0
|
132 |
+
β β βββ 000001.wav
|
133 |
+
β β βββ 000xxx.wav
|
134 |
+
β βββ speaker1
|
135 |
+
β βββ 000001.wav
|
136 |
+
β βββ 000xxx.wav
|
137 |
+
βββ waves-32k
|
138 |
+
β βββ speaker0
|
139 |
+
β β βββ 000001.wav
|
140 |
+
β β βββ 000xxx.wav
|
141 |
+
β βββ speaker1
|
142 |
+
β βββ 000001.wav
|
143 |
+
β βββ 000xxx.wav
|
144 |
+
βββ pitch
|
145 |
+
β βββ speaker0
|
146 |
+
β β βββ 000001.pit.npy
|
147 |
+
β β βββ 000xxx.pit.npy
|
148 |
+
β βββ speaker1
|
149 |
+
β βββ 000001.pit.npy
|
150 |
+
β βββ 000xxx.pit.npy
|
151 |
+
βββ whisper
|
152 |
+
β βββ speaker0
|
153 |
+
β β βββ 000001.ppg.npy
|
154 |
+
β β βββ 000xxx.ppg.npy
|
155 |
+
β βββ speaker1
|
156 |
+
β βββ 000001.ppg.npy
|
157 |
+
β βββ 000xxx.ppg.npy
|
158 |
+
βββ speaker
|
159 |
+
β βββ speaker0
|
160 |
+
β β βββ 000001.spk.npy
|
161 |
+
β β βββ 000xxx.spk.npy
|
162 |
+
β βββ speaker1
|
163 |
+
β βββ 000001.spk.npy
|
164 |
+
β βββ 000xxx.spk.npy
|
165 |
+
βββ singer
|
166 |
+
βββ speaker0.spk.npy
|
167 |
+
βββ speaker1.spk.npy
|
168 |
+
```
|
169 |
+
|
170 |
+
## Train
|
171 |
+
- 0οΌ if fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0_main_1500.pth
|
172 |
+
|
173 |
+
> set pretrain: "./sovits5.0_main_1500.pth" in configs/base.yamlοΌand adjust the learning rate appropriately, eg 1e-5
|
174 |
+
|
175 |
+
- 1οΌ set working directory
|
176 |
+
|
177 |
+
> export PYTHONPATH=$PWD
|
178 |
+
|
179 |
+
- 2οΌ start training
|
180 |
+
|
181 |
+
> python svc_trainer.py -c configs/base.yaml -n sovits5.0
|
182 |
+
|
183 |
+
- 3οΌ resume training
|
184 |
+
|
185 |
+
> python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth
|
186 |
+
|
187 |
+
- 4οΌ view log
|
188 |
+
|
189 |
+
> tensorboard --logdir logs/
|
190 |
+
|
191 |
+
![sovits5 0_base](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/1628e775-5888-4eac-b173-a28dca978faa)
|
192 |
+
|
193 |
+
## Inference
|
194 |
+
|
195 |
+
- 1οΌ set working directory
|
196 |
+
|
197 |
+
> export PYTHONPATH=$PWD
|
198 |
+
|
199 |
+
- 2οΌ export inference model: text encoder, Flow network, Decoder network
|
200 |
+
|
201 |
+
> python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
|
202 |
+
|
203 |
+
- 3οΌ use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage
|
204 |
+
|
205 |
+
> python whisper/inference.py -w test.wav -p test.ppg.npy
|
206 |
+
|
207 |
+
generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically
|
208 |
+
|
209 |
+
- 4οΌ extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
|
210 |
+
|
211 |
+
> python pitch/inference.py -w test.wav -p test.csv
|
212 |
+
|
213 |
+
- 5οΌspecify parameters and infer
|
214 |
+
|
215 |
+
> python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --pit test.csv
|
216 |
+
|
217 |
+
when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
|
218 |
+
|
219 |
+
when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
|
220 |
+
|
221 |
+
generate files in the current directory:svc_out.wav
|
222 |
+
|
223 |
+
| args |--config | --model | --spk | --wave | --ppg | --pit | --shift |
|
224 |
+
| --- | --- | --- | --- | --- | --- | --- | --- |
|
225 |
+
| name | config path | model path | speaker | wave input | wave ppg | wave pitch | pitch shift |
|
226 |
+
|
227 |
+
## Creat singer
|
228 |
+
named by pure coincidenceοΌaverage -> ave -> evaοΌeve(eva) represents conception and reproduction
|
229 |
+
|
230 |
+
> python svc_eva.py
|
231 |
+
|
232 |
+
```python
|
233 |
+
eva_conf = {
|
234 |
+
'./configs/singers/singer0022.npy': 0,
|
235 |
+
'./configs/singers/singer0030.npy': 0,
|
236 |
+
'./configs/singers/singer0047.npy': 0.5,
|
237 |
+
'./configs/singers/singer0051.npy': 0.5,
|
238 |
+
}
|
239 |
+
```
|
240 |
+
|
241 |
+
the generated singer file isοΌeva.spk.npy
|
242 |
+
|
243 |
+
πboth Flow and Decoder need to input timbres, and you can even input different timbre parameters to the two modules to create more unique timbres.
|
244 |
+
|
245 |
+
## Data set
|
246 |
+
|
247 |
+
| Name | URL |
|
248 |
+
| --- | --- |
|
249 |
+
|KiSing |http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/|
|
250 |
+
|PopCS |https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md|
|
251 |
+
|opencpop |https://wenet.org.cn/opencpop/download/|
|
252 |
+
|Multi-Singer |https://github.com/Multi-Singer/Multi-Singer.github.io|
|
253 |
+
|M4Singer |https://github.com/M4Singer/M4Singer/blob/master/apply_form.md|
|
254 |
+
|CSD |https://zenodo.org/record/4785016#.YxqrTbaOMU4|
|
255 |
+
|KSS |https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset|
|
256 |
+
|JVS MuSic |https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music|
|
257 |
+
|PJS |https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus|
|
258 |
+
|JUST Song |https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song|
|
259 |
+
|MUSDB18 |https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems|
|
260 |
+
|DSD100 |https://sigsep.github.io/datasets/dsd100.html|
|
261 |
+
|Aishell-3 |http://www.aishelltech.com/aishell_3|
|
262 |
+
|VCTK |https://datashare.ed.ac.uk/handle/10283/2651|
|
263 |
+
|
264 |
+
## Code sources and references
|
265 |
+
|
266 |
+
https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355)
|
267 |
+
|
268 |
+
https://github.com/jaywalnut310/vits [paper](https://arxiv.org/abs/2106.06103)
|
269 |
+
|
270 |
+
https://github.com/openai/whisper/ [paper](https://arxiv.org/abs/2212.04356)
|
271 |
+
|
272 |
+
https://github.com/NVIDIA/BigVGAN [paper](https://arxiv.org/abs/2206.04658)
|
273 |
+
|
274 |
+
https://github.com/mindslab-ai/univnet [paper](https://arxiv.org/abs/2106.07889)
|
275 |
+
|
276 |
+
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
|
277 |
+
|
278 |
+
https://github.com/brentspell/hifi-gan-bwe
|
279 |
+
|
280 |
+
https://github.com/mozilla/TTS
|
281 |
+
|
282 |
+
https://github.com/OlaWod/FreeVC [paper](https://arxiv.org/abs/2210.15418)
|
283 |
+
|
284 |
+
[SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech](https://github.com/hcy71o/SNAC)
|
285 |
+
|
286 |
+
[Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585)
|
287 |
+
|
288 |
+
[AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/pdf/2103.00993.pdf)
|
289 |
+
|
290 |
+
[Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis](https://github.com/ubisoft/ubisoft-laforge-daft-exprt)
|
291 |
+
|
292 |
+
[Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings](https://arxiv.org/abs/2305.05401)
|
293 |
+
|
294 |
+
[Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion](https://arxiv.org/pdf/2305.09167.pdf)
|
295 |
+
|
296 |
+
[Speaker normalization (GRL) for self-supervised speech emotion recognition](https://arxiv.org/abs/2202.01252)
|
297 |
+
|
298 |
+
## Method of Preventing Timbre Leakage Based on Data Perturbation
|
299 |
+
|
300 |
+
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
|
301 |
+
|
302 |
+
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
|
303 |
+
|
304 |
+
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
|
305 |
+
|
306 |
+
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
|
307 |
+
|
308 |
+
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
|
309 |
+
|
310 |
+
## Contributors
|
311 |
+
|
312 |
+
<a href="https://github.com/PlayVoice/so-vits-svc/graphs/contributors">
|
313 |
+
<img src="https://contrib.rocks/image?repo=PlayVoice/so-vits-svc" />
|
314 |
+
</a>
|
315 |
+
|
316 |
---
|
317 |
license: afl-3.0
|
318 |
+
---
|