|
<div align="center"> |
|
<h1> Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS </h1> |
|
|
|
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/maxmax20160403/sovits5.0) |
|
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PY1E4bDAeHbAD4r99D_oYXB46fG8nIA5?usp=sharing) |
|
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PlayVoice/so-vits-svc-5.0"> |
|
<img alt="GitHub forks" src="https://img.shields.io/github/forks/PlayVoice/so-vits-svc-5.0"> |
|
<img alt="GitHub issues" src="https://img.shields.io/github/issues/PlayVoice/so-vits-svc-5.0"> |
|
<img alt="GitHub" src="https://img.shields.io/github/license/PlayVoice/so-vits-svc-5.0"> |
|
|
|
</div> |
|
|
|
- πThis project is target for: beginners in deep learning, the basic operation of Python and PyTorch is the prerequisite for using this project; |
|
- πThis project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practice; |
|
- πThis project does not support real-time voice change; (support needs to replace whisper) |
|
- πThis project will not develop one-click packages for other purposesοΌ |
|
|
|
![sovits_framework](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/402cf58d-6d03-4d0b-9d6a-94f079898672) |
|
|
|
- 6G memory GPU can be used to trained |
|
|
|
- support for multiple speakers |
|
|
|
- create unique speakers through speaker mixing |
|
|
|
- even with light accompaniment can also be converted |
|
|
|
- F0 can be edited using Excel |
|
|
|
## Model properties |
|
|
|
https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/hifigan_release |
|
|
|
- [sovits5.0_main_1500.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/download/hifigan_release/sovits5.0_main_1500.pth) The model includes: generator + discriminator = 176M, which can be used as a pre-training model |
|
- speakers files are in the configs/singers directory, which can be used for reasoning tests, especially for timbre leakage |
|
- speakers 22, 30, 47, and 51 are highly recognizable, and the training audio samples are in the configs/singers_sample directory |
|
|
|
| Feature | From | Status | Function | Remarks | |
|
| --- | --- | --- | --- | --- | |
|
| whisper | OpenAI | β
| strong noise immunity | - | |
|
| bigvgan | NVIDA | β
| alias and snake | The GPU takes up a little more, and the main branch is deleted; You need to switch to the branch [bigvgan](https://github.com/PlayVoice/so-vits-svc-5.0/tree/bigvgan)οΌthe formant is clearer and the sound quality is obviously improved | |
|
| natural speech | Microsoft | β
| reduce mispronunciation | - | |
|
| neural source-filter | NII | β
| solve the problem of audio F0 discontinuity | - | |
|
| speaker encoder | Google | β
| Timbre Encoding and Clustering | - | |
|
| GRL for speaker | Ubisoft |β
| Preventing Encoder Leakage Timbre | - | |
|
| one shot vits | Samsung | β
| Voice Clone | - | |
|
| SCLN | Microsoft | β
| Improve Clone | - | |
|
| PPG perturbation | this project | β
| Improved noise immunity and de-timbre | - | |
|
| VAE perturbation | this project | β
| Improve sound quality | - | |
|
|
|
πdue to the use of data perturbation, it takes longer to train than other projects. |
|
|
|
## Dataset preparation |
|
|
|
Necessary pre-processing: |
|
- 1 accompaniment separation |
|
- 2 band extension |
|
- 3 sound quality improvement |
|
- 4 cut audio, less than 30 seconds for whisperπ |
|
|
|
then put the dataset into the dataset_raw directory according to the following file structure |
|
```shell |
|
dataset_raw |
|
ββββspeaker0 |
|
β ββββ000001.wav |
|
β ββββ... |
|
β ββββ000xxx.wav |
|
ββββspeaker1 |
|
ββββ000001.wav |
|
ββββ... |
|
ββββ000xxx.wav |
|
``` |
|
|
|
## Install dependencies |
|
|
|
- 1 software dependency |
|
|
|
> apt update && sudo apt install ffmpeg |
|
|
|
> pip install -r requirements.txt |
|
|
|
- 2 download the Timbre Encoder: [Speaker-Encoder by @mueller91](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), put `best_model.pth.tar` into `speaker_pretrain/` |
|
|
|
- 3 download whisper model [multiple language medium model](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt), Make sure to download `medium.pt`οΌput it into `whisper_pretrain/` |
|
|
|
- 4 whisper is built-in, do not install it additionally, it will conflict and report an error |
|
|
|
## Data preprocessing |
|
- 1οΌ set working directory: |
|
|
|
> export PYTHONPATH=$PWD |
|
|
|
- 2οΌ re-sampling |
|
|
|
generate audio with a sampling rate of 16000HzοΌ./data_svc/waves-16k |
|
|
|
> python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000 |
|
|
|
generate audio with a sampling rate of 32000HzοΌ./data_svc/waves-32k |
|
|
|
> python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000 |
|
|
|
- 3οΌ use 16K audio to extract pitchοΌf0_ceil=900, it needs to be modified according to the highest pitch of your data |
|
> python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch |
|
|
|
or use next for low quality audio |
|
|
|
> python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch |
|
|
|
- 4οΌ use 16K audio to extract ppg |
|
> python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper |
|
|
|
- 5οΌ use 16k audio to extract timbre code |
|
> python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker |
|
|
|
- 6οΌ extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training |
|
> python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer |
|
|
|
- 7οΌ use 32k audio to extract the linear spectrum |
|
> python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs |
|
|
|
- 8οΌ use 32k audio to generate training index |
|
> python prepare/preprocess_train.py |
|
|
|
- 9οΌ training file debugging |
|
> python prepare/preprocess_zzz.py |
|
|
|
```shell |
|
data_svc/ |
|
βββ waves-16k |
|
β βββ speaker0 |
|
β β βββ 000001.wav |
|
β β βββ 000xxx.wav |
|
β βββ speaker1 |
|
β βββ 000001.wav |
|
β βββ 000xxx.wav |
|
βββ waves-32k |
|
β βββ speaker0 |
|
β β βββ 000001.wav |
|
β β βββ 000xxx.wav |
|
β βββ speaker1 |
|
β βββ 000001.wav |
|
β βββ 000xxx.wav |
|
βββ pitch |
|
β βββ speaker0 |
|
β β βββ 000001.pit.npy |
|
β β βββ 000xxx.pit.npy |
|
β βββ speaker1 |
|
β βββ 000001.pit.npy |
|
β βββ 000xxx.pit.npy |
|
βββ whisper |
|
β βββ speaker0 |
|
β β βββ 000001.ppg.npy |
|
β β βββ 000xxx.ppg.npy |
|
β βββ speaker1 |
|
β βββ 000001.ppg.npy |
|
β βββ 000xxx.ppg.npy |
|
βββ speaker |
|
β βββ speaker0 |
|
β β βββ 000001.spk.npy |
|
β β βββ 000xxx.spk.npy |
|
β βββ speaker1 |
|
β βββ 000001.spk.npy |
|
β βββ 000xxx.spk.npy |
|
βββ singer |
|
βββ speaker0.spk.npy |
|
βββ speaker1.spk.npy |
|
``` |
|
|
|
## Train |
|
- 0οΌ if fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0_main_1500.pth |
|
|
|
> set pretrain: "./sovits5.0_main_1500.pth" in configs/base.yamlοΌand adjust the learning rate appropriately, eg 1e-5 |
|
|
|
- 1οΌ set working directory |
|
|
|
> export PYTHONPATH=$PWD |
|
|
|
- 2οΌ start training |
|
|
|
> python svc_trainer.py -c configs/base.yaml -n sovits5.0 |
|
|
|
- 3οΌ resume training |
|
|
|
> python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth |
|
|
|
- 4οΌ view log |
|
|
|
> tensorboard --logdir logs/ |
|
|
|
![sovits5 0_base](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/1628e775-5888-4eac-b173-a28dca978faa) |
|
|
|
## Inference |
|
|
|
- 1οΌ set working directory |
|
|
|
> export PYTHONPATH=$PWD |
|
|
|
- 2οΌ export inference model: text encoder, Flow network, Decoder network |
|
|
|
> python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt |
|
|
|
- 3οΌ use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage |
|
|
|
> python whisper/inference.py -w test.wav -p test.ppg.npy |
|
|
|
generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically |
|
|
|
- 4οΌ extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser |
|
|
|
> python pitch/inference.py -w test.wav -p test.csv |
|
|
|
- 5οΌspecify parameters and infer |
|
|
|
> python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --pit test.csv |
|
|
|
when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted; |
|
|
|
when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted; |
|
|
|
generate files in the current directory:svc_out.wav |
|
|
|
| args |--config | --model | --spk | --wave | --ppg | --pit | --shift | |
|
| --- | --- | --- | --- | --- | --- | --- | --- | |
|
| name | config path | model path | speaker | wave input | wave ppg | wave pitch | pitch shift | |
|
|
|
## Creat singer |
|
named by pure coincidenceοΌaverage -> ave -> evaοΌeve(eva) represents conception and reproduction |
|
|
|
> python svc_eva.py |
|
|
|
```python |
|
eva_conf = { |
|
'./configs/singers/singer0022.npy': 0, |
|
'./configs/singers/singer0030.npy': 0, |
|
'./configs/singers/singer0047.npy': 0.5, |
|
'./configs/singers/singer0051.npy': 0.5, |
|
} |
|
``` |
|
|
|
the generated singer file isοΌeva.spk.npy |
|
|
|
πboth Flow and Decoder need to input timbres, and you can even input different timbre parameters to the two modules to create more unique timbres. |
|
|
|
## Data set |
|
|
|
| Name | URL | |
|
| --- | --- | |
|
|KiSing |http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/| |
|
|PopCS |https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md| |
|
|opencpop |https://wenet.org.cn/opencpop/download/| |
|
|Multi-Singer |https://github.com/Multi-Singer/Multi-Singer.github.io| |
|
|M4Singer |https://github.com/M4Singer/M4Singer/blob/master/apply_form.md| |
|
|CSD |https://zenodo.org/record/4785016#.YxqrTbaOMU4| |
|
|KSS |https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset| |
|
|JVS MuSic |https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music| |
|
|PJS |https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus| |
|
|JUST Song |https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song| |
|
|MUSDB18 |https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems| |
|
|DSD100 |https://sigsep.github.io/datasets/dsd100.html| |
|
|Aishell-3 |http://www.aishelltech.com/aishell_3| |
|
|VCTK |https://datashare.ed.ac.uk/handle/10283/2651| |
|
|
|
## Code sources and references |
|
|
|
https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355) |
|
|
|
https://github.com/jaywalnut310/vits [paper](https://arxiv.org/abs/2106.06103) |
|
|
|
https://github.com/openai/whisper/ [paper](https://arxiv.org/abs/2212.04356) |
|
|
|
https://github.com/NVIDIA/BigVGAN [paper](https://arxiv.org/abs/2206.04658) |
|
|
|
https://github.com/mindslab-ai/univnet [paper](https://arxiv.org/abs/2106.07889) |
|
|
|
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf |
|
|
|
https://github.com/brentspell/hifi-gan-bwe |
|
|
|
https://github.com/mozilla/TTS |
|
|
|
https://github.com/OlaWod/FreeVC [paper](https://arxiv.org/abs/2210.15418) |
|
|
|
[SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech](https://github.com/hcy71o/SNAC) |
|
|
|
[Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585) |
|
|
|
[AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/pdf/2103.00993.pdf) |
|
|
|
[Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis](https://github.com/ubisoft/ubisoft-laforge-daft-exprt) |
|
|
|
[Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings](https://arxiv.org/abs/2305.05401) |
|
|
|
[Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion](https://arxiv.org/pdf/2305.09167.pdf) |
|
|
|
[Speaker normalization (GRL) for self-supervised speech emotion recognition](https://arxiv.org/abs/2202.01252) |
|
|
|
## Method of Preventing Timbre Leakage Based on Data Perturbation |
|
|
|
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py |
|
|
|
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py |
|
|
|
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py |
|
|
|
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py |
|
|
|
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py |
|
|
|
## Contributors |
|
|
|
<a href="https://github.com/PlayVoice/so-vits-svc/graphs/contributors"> |
|
<img src="https://contrib.rocks/image?repo=PlayVoice/so-vits-svc" /> |
|
</a> |
|
|
|
--- |
|
license: afl-3.0 |
|
--- |