RMSnow's picture
Update README.md of svc (#2)
d9043cd
metadata
license: mit

Amphion Singing Voice Conversion Pretrained Models

Quick Start

We provide a DiffWaveNetSVC pretrained checkpoint for you to play. Specially, it is trained under the real-world vocalist data (total duration: 6.16 hours), including the following 15 professional singers:

Singer Language Training Duration (mins)
David Tao 陶喆 Chinese 45.51
Eason Chan 陈奕迅 Chinese 43.36
Feng Wang 汪峰 Chinese 41.08
Jian Li 李健 Chinese 38.90
John Mayer English 30.83
Adele English 27.23
Ying Na 那英 Chinese 27.02
Yijie Shi 石倚洁 Chinese 24.93
Jacky Cheung 张学友 Chinese 18.31
Taylor Swift English 18.31
Faye Wong 王菲 English 16.78
Michael Jackson English 15.13
Tsai Chin 蔡琴 Chinese 10.12
Bruno Mars English 6.29
Beyonce English 6.06

To make these singers sing the songs you want to listen to, just run the following commands:

Step1: Download the acoustics model checkpoint

git lfs install
git clone https://huggingface.co/amphion/singing_voice_conversion

Step2: Download the vocoder checkpoint

git clone https://huggingface.co/amphion/BigVGAN_singing_bigdata

Step3: Clone the Amphion's Source Code of GitHub

git clone https://github.com/open-mmlab/Amphion.git

Step4: Download ContentVec Checkpoint

You could download ContentVec Checkpoint from this repo. In this pretrained model, we used checkpoint_best_legacy_500.pt, which is the legacy ContentVec with 500 classes.

Step5: Specify the checkpoints' path

Use the soft link to specify the downloaded checkpoints:

cd Amphion
mkdir -p ckpts/svc
ln -s "$(realpath ../singing_voice_conversion/vocalist_l1_contentvec+whisper)" ckpts/svc/vocalist_l1_contentvec+whisper
ln -s "$(realpath ../BigVGAN_singing_bigdata/bigvgan_singing)" pretrained/bigvgan_singing

Also, you need to move checkpoint_best_legacy_500.pt you downloaded at Step4 into Amphion/pretrained/contentvec.

Step6: Conversion

You can follow this recipe to conduct the conversion. For example, if you want to make Taylor Swift sing the songs in the [Your Audios Folder], just run:

sh egs/svc/MultipleContentsSVC/run.sh --stage 3 --gpu "0" \
    --config "ckpts/svc/vocalist_l1_contentvec+whisper/args.json" \
    --infer_expt_dir "ckpts/svc/vocalist_l1_contentvec+whisper" \
    --infer_output_dir "ckpts/svc/vocalist_l1_contentvec+whisper/result" \
    --infer_source_audio_dir [Your Audios Folder] \
    --infer_vocoder_dir "pretrained/bigvgan_singing" \
    --infer_target_speaker "vocalist_l1_TaylorSwift" \
    --infer_key_shift "autoshift"

Note: The supported infer_target_speaker values can be seen here.

Citaions

@article{zhang2023leveraging,
  title={Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion},
  author={Zhang, Xueyao and Gu, Yicheng and Chen, Haopeng and Fang, Zihao and Zou, Lexiao and Xue, Liumeng and Wu, Zhizheng},
  journal={Machine Learning for Audio Worshop, NeurIPS 2023},
  year={2023}
}