File size: 13,510 Bytes
e96a1ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21bb99c
 
e96a1ea
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
<div align="center">
<h1> Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS </h1>
    
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/maxmax20160403/sovits5.0)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PY1E4bDAeHbAD4r99D_oYXB46fG8nIA5?usp=sharing)
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PlayVoice/so-vits-svc-5.0">
<img alt="GitHub forks" src="https://img.shields.io/github/forks/PlayVoice/so-vits-svc-5.0">
<img alt="GitHub issues" src="https://img.shields.io/github/issues/PlayVoice/so-vits-svc-5.0">
<img alt="GitHub" src="https://img.shields.io/github/license/PlayVoice/so-vits-svc-5.0">
 
</div>

- πŸ’—This project is target for: beginners in deep learning, the basic operation of Python and PyTorch is the prerequisite for using this project;
- πŸ’—This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practice;
- πŸ’—This project does not support real-time voice change; (support needs to replace whisper)
- πŸ’—This project will not develop one-click packages for other purposesοΌ›

![sovits_framework](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/402cf58d-6d03-4d0b-9d6a-94f079898672)

- 6G memory GPU can be used to trained

- support for multiple speakers

- create unique speakers through speaker mixing

- even with light accompaniment can also be converted

- F0 can be edited using Excel

## Model properties

https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/hifigan_release

- [sovits5.0_main_1500.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/download/hifigan_release/sovits5.0_main_1500.pth) The model includes: generator + discriminator = 176M, which can be used as a pre-training model
- speakers files are in the configs/singers directory, which can be used for reasoning tests, especially for timbre leakage
- speakers 22, 30, 47, and 51 are highly recognizable, and the training audio samples are in the configs/singers_sample directory

| Feature | From | Status | Function | Remarks |
| --- | --- | --- | --- | --- |
| whisper | OpenAI | βœ… | strong noise immunity | - |
| bigvgan  | NVIDA | βœ… | alias and snake | The GPU takes up a little more, and the main branch is deleted; You need to switch to the branch [bigvgan](https://github.com/PlayVoice/so-vits-svc-5.0/tree/bigvgan),the formant is clearer and the sound quality is obviously improved |
| natural speech | Microsoft | βœ… | reduce mispronunciation | - |
| neural source-filter | NII | βœ… | solve the problem of audio F0 discontinuity | - |
| speaker encoder | Google | βœ… | Timbre Encoding and Clustering | - |
| GRL for speaker | Ubisoft |βœ… | Preventing Encoder Leakage Timbre | - |
| one shot vits |  Samsung | βœ… | Voice Clone | - |
| SCLN |  Microsoft | βœ… | Improve Clone | - |
| PPG perturbation | this project | βœ… | Improved noise immunity and de-timbre | - |
| VAE perturbation | this project | βœ… | Improve sound quality | - |

πŸ’—due to the use of data perturbation, it takes longer to train than other projects.

## Dataset preparation

Necessary pre-processing:
- 1 accompaniment separation
- 2 band extension
- 3 sound quality improvement
- 4 cut audio, less than 30 seconds for whisperπŸ’—

then put the dataset into the dataset_raw directory according to the following file structure
```shell
dataset_raw
β”œβ”€β”€β”€speaker0
β”‚   β”œβ”€β”€β”€000001.wav
β”‚   β”œβ”€β”€β”€...
β”‚   └───000xxx.wav
└───speaker1
    β”œβ”€β”€β”€000001.wav
    β”œβ”€β”€β”€...
    └───000xxx.wav
```

## Install dependencies

- 1 software dependency
  
  > apt update && sudo apt install ffmpeg
  
  > pip install -r requirements.txt

- 2 download the Timbre Encoder: [Speaker-Encoder by @mueller91](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), put `best_model.pth.tar`  into `speaker_pretrain/`

- 3 download whisper model [multiple language medium model](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt), Make sure to download `medium.pt`,put it into `whisper_pretrain/`

- 4 whisper is built-in, do not install it additionally, it will conflict and report an error

## Data preprocessing
- 1, set working directory:

    > export PYTHONPATH=$PWD

- 2, re-sampling

    generate audio with a sampling rate of 16000Hz:./data_svc/waves-16k

    > python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000

    generate audio with a sampling rate of 32000Hz:./data_svc/waves-32k

    > python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000

- 3, use 16K audio to extract pitch:f0_ceil=900, it needs to be modified according to the highest pitch of your data
    > python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch

    or use next for low quality audio

    > python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch

- 4, use 16K audio to extract ppg
    > python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper

- 5, use 16k audio to extract timbre code
    > python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker

- 6, extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
    > python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer

- 7, use 32k audio to extract the linear spectrum
    > python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs

- 8, use 32k audio to generate training index
    > python prepare/preprocess_train.py

- 9, training file debugging
    > python prepare/preprocess_zzz.py

```shell
data_svc/
└── waves-16k
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.wav
β”‚    β”‚      └── 000xxx.wav
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.wav
β”‚           └── 000xxx.wav
└── waves-32k
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.wav
β”‚    β”‚      └── 000xxx.wav
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.wav
β”‚           └── 000xxx.wav
└── pitch
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.pit.npy
β”‚    β”‚      └── 000xxx.pit.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.pit.npy
β”‚           └── 000xxx.pit.npy
└── whisper
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.ppg.npy
β”‚    β”‚      └── 000xxx.ppg.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.ppg.npy
β”‚           └── 000xxx.ppg.npy
└── speaker
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.spk.npy
β”‚    β”‚      └── 000xxx.spk.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.spk.npy
β”‚           └── 000xxx.spk.npy
└── singer
    β”œβ”€β”€ speaker0.spk.npy
    └── speaker1.spk.npy
```

## Train
- 0, if fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0_main_1500.pth

    > set pretrain: "./sovits5.0_main_1500.pth" in configs/base.yaml,and adjust the learning rate appropriately, eg 1e-5

- 1, set working directory

    > export PYTHONPATH=$PWD

- 2, start training

    > python svc_trainer.py -c configs/base.yaml -n sovits5.0

- 3, resume training

    > python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth

- 4, view log

    > tensorboard --logdir logs/

![sovits5 0_base](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/1628e775-5888-4eac-b173-a28dca978faa)

## Inference

- 1, set working directory

    > export PYTHONPATH=$PWD

- 2, export inference model: text encoder, Flow network, Decoder network

    > python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt

- 3, use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage

    > python whisper/inference.py -w test.wav -p test.ppg.npy

    generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically

- 4, extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser

    > python pitch/inference.py -w test.wav -p test.csv

- 5,specify parameters and infer

    > python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --pit test.csv

    when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;

    when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;

    generate files in the current directory:svc_out.wav

    | args |--config | --model | --spk | --wave | --ppg | --pit | --shift |
    | ---  | --- | --- | --- | --- | --- | --- | --- |
    | name | config path | model path | speaker | wave input | wave ppg | wave pitch | pitch shift |

## Creat singer
named by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction

> python svc_eva.py

```python
eva_conf = {
    './configs/singers/singer0022.npy': 0,
    './configs/singers/singer0030.npy': 0,
    './configs/singers/singer0047.npy': 0.5,
    './configs/singers/singer0051.npy': 0.5,
}
```

the generated singer file is:eva.spk.npy

πŸ’—both Flow and Decoder need to input timbres, and you can even input different timbre parameters to the two modules to create more unique timbres.

## Data set

| Name | URL |
| --- | --- |
|KiSing         |http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/|
|PopCS          |https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md|
|opencpop       |https://wenet.org.cn/opencpop/download/|
|Multi-Singer   |https://github.com/Multi-Singer/Multi-Singer.github.io|
|M4Singer       |https://github.com/M4Singer/M4Singer/blob/master/apply_form.md|
|CSD            |https://zenodo.org/record/4785016#.YxqrTbaOMU4|
|KSS            |https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset|
|JVS MuSic      |https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music|
|PJS            |https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus|
|JUST Song      |https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song|
|MUSDB18        |https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems|
|DSD100         |https://sigsep.github.io/datasets/dsd100.html|
|Aishell-3      |http://www.aishelltech.com/aishell_3|
|VCTK           |https://datashare.ed.ac.uk/handle/10283/2651|

## Code sources and references

https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355)

https://github.com/jaywalnut310/vits [paper](https://arxiv.org/abs/2106.06103)

https://github.com/openai/whisper/ [paper](https://arxiv.org/abs/2212.04356)

https://github.com/NVIDIA/BigVGAN [paper](https://arxiv.org/abs/2206.04658)

https://github.com/mindslab-ai/univnet [paper](https://arxiv.org/abs/2106.07889)

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf

https://github.com/brentspell/hifi-gan-bwe

https://github.com/mozilla/TTS

https://github.com/OlaWod/FreeVC [paper](https://arxiv.org/abs/2210.15418)

[SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech](https://github.com/hcy71o/SNAC)

[Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585)

[AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/pdf/2103.00993.pdf)

[Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis](https://github.com/ubisoft/ubisoft-laforge-daft-exprt)

[Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings](https://arxiv.org/abs/2305.05401)

[Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion](https://arxiv.org/pdf/2305.09167.pdf)

[Speaker normalization (GRL) for self-supervised speech emotion recognition](https://arxiv.org/abs/2202.01252)

## Method of Preventing Timbre Leakage Based on Data Perturbation

https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py

https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py

https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py

https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py

https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py

## Contributors

<a href="https://github.com/PlayVoice/so-vits-svc/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=PlayVoice/so-vits-svc" />
</a>

---
license: afl-3.0
---