metadata

license: mit

Amphion Singing Voice Conversion Pretrained Models

Quick Start

We provide a DiffWaveNetSVC pretrained checkpoint for you to play. Specially, it is trained under the real-world vocalist data (total duration: 6.16 hours), including the following 15 professional singers:

Singer	Language	Training Duration (mins)
David Tao 陶喆	Chinese	45.51
Eason Chan 陈奕迅	Chinese	43.36
Feng Wang 汪峰	Chinese	41.08
Jian Li 李健	Chinese	38.90
John Mayer	English	30.83
Adele	English	27.23
Ying Na 那英	Chinese	27.02
Yijie Shi 石倚洁	Chinese	24.93
Jacky Cheung 张学友	Chinese	18.31
Taylor Swift	English	18.31
Faye Wong 王菲	English	16.78
Michael Jackson	English	15.13
Tsai Chin 蔡琴	Chinese	10.12
Bruno Mars	English	6.29
Beyonce	English	6.06

To make these singers sing the songs you want to listen to, just run the following commands:

Step1: Download the acoustics model checkpoint

git lfs install
git clone https://huggingface.co/amphion/singing_voice_conversion

Step2: Download the vocoder checkpoint

git clone https://huggingface.co/amphion/BigVGAN_singing_bigdata

Step3: Clone the Amphion's Source Code of GitHub

git clone https://github.com/open-mmlab/Amphion.git

Step4: Download ContentVec Checkpoint

You could download ContentVec Checkpoint from this repo. In this pretrained model, we used checkpoint_best_legacy_500.pt, which is the legacy ContentVec with 500 classes.

Step5: Specify the checkpoints' path

Use the soft link to specify the downloaded checkpoints:

cd Amphion
mkdir -p ckpts/svc
ln -s "$(realpath ../singing_voice_conversion/vocalist_l1_contentvec+whisper)" ckpts/svc/vocalist_l1_contentvec+whisper
ln -s "$(realpath ../BigVGAN_singing_bigdata/bigvgan_singing)" pretrained/bigvgan_singing

Also, you need to move checkpoint_best_legacy_500.pt you downloaded at Step4 into Amphion/pretrained/contentvec.

Step6: Conversion

You can follow this recipe to conduct the conversion. For example, if you want to make Taylor Swift sing the songs in the [Your Audios Folder], just run:

sh egs/svc/MultipleContentsSVC/run.sh --stage 3 --gpu "0" \
    --config "ckpts/svc/vocalist_l1_contentvec+whisper/args.json" \
    --infer_expt_dir "ckpts/svc/vocalist_l1_contentvec+whisper" \
    --infer_output_dir "ckpts/svc/vocalist_l1_contentvec+whisper/result" \
    --infer_source_audio_dir [Your Audios Folder] \
    --infer_vocoder_dir "pretrained/bigvgan_singing" \
    --infer_target_speaker "vocalist_l1_TaylorSwift" \
    --infer_key_shift "autoshift"

Note: The supported infer_target_speaker values can be seen here.

Citaions

@article{zhang2023leveraging,
  title={Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion},
  author={Zhang, Xueyao and Gu, Yicheng and Chen, Haopeng and Fang, Zihao and Zou, Lexiao and Xue, Liumeng and Wu, Zhizheng},
  journal={Machine Learning for Audio Worshop, NeurIPS 2023},
  year={2023}
}