license: mit
Amphion Singing Voice Conversion Pretrained Models
Quick Start
We provide a DiffWaveNetSVC pretrained checkpoint for you to play. Specially, it is trained under the real-world vocalist data (total duration: 6.16 hours), including the following 15 professional singers:
Singer | Language | Training Duration (mins) |
---|---|---|
David Tao 陶喆 | Chinese | 45.51 |
Eason Chan 陈奕迅 | Chinese | 43.36 |
Feng Wang 汪峰 | Chinese | 41.08 |
Jian Li 李健 | Chinese | 38.90 |
John Mayer | English | 30.83 |
Adele | English | 27.23 |
Ying Na 那英 | Chinese | 27.02 |
Yijie Shi 石倚洁 | Chinese | 24.93 |
Jacky Cheung 张学友 | Chinese | 18.31 |
Taylor Swift | English | 18.31 |
Faye Wong 王菲 | English | 16.78 |
Michael Jackson | English | 15.13 |
Tsai Chin 蔡琴 | Chinese | 10.12 |
Bruno Mars | English | 6.29 |
Beyonce | English | 6.06 |
To make these singers sing the songs you want to listen to, just run the following commands:
Step1: Download the acoustics model checkpoint
git lfs install
git clone https://huggingface.co/amphion/singing_voice_conversion
Step2: Download the vocoder checkpoint
git clone https://huggingface.co/amphion/BigVGAN_singing_bigdata
Step3: Clone the Amphion's Source Code of GitHub
git clone https://github.com/open-mmlab/Amphion.git
Step4: Download ContentVec Checkpoint
You could download ContentVec Checkpoint from this repo. In this pretrained model, we used checkpoint_best_legacy_500.pt
, which is the legacy ContentVec with 500 classes.
Step5: Specify the checkpoints' path
Use the soft link to specify the downloaded checkpoints:
cd Amphion
mkdir -p ckpts/svc
ln -s "$(realpath ../singing_voice_conversion/vocalist_l1_contentvec+whisper)" ckpts/svc/vocalist_l1_contentvec+whisper
ln -s "$(realpath ../BigVGAN_singing_bigdata/bigvgan_singing)" pretrained/bigvgan_singing
Also, you need to move checkpoint_best_legacy_500.pt
you downloaded at Step4 into Amphion/pretrained/contentvec
.
Step6: Conversion
You can follow this recipe to conduct the conversion. For example, if you want to make Taylor Swift sing the songs in the [Your Audios Folder]
, just run:
sh egs/svc/MultipleContentsSVC/run.sh --stage 3 --gpu "0" \
--config "ckpts/svc/vocalist_l1_contentvec+whisper/args.json" \
--infer_expt_dir "ckpts/svc/vocalist_l1_contentvec+whisper" \
--infer_output_dir "ckpts/svc/vocalist_l1_contentvec+whisper/result" \
--infer_source_audio_dir [Your Audios Folder] \
--infer_vocoder_dir "pretrained/bigvgan_singing" \
--infer_target_speaker "vocalist_l1_TaylorSwift" \
--infer_key_shift "autoshift"
Note: The supported infer_target_speaker
values can be seen here.
Citaions
@article{zhang2023leveraging,
title={Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion},
author={Zhang, Xueyao and Gu, Yicheng and Chen, Haopeng and Fang, Zihao and Zou, Lexiao and Xue, Liumeng and Wu, Zhizheng},
journal={Machine Learning for Audio Worshop, NeurIPS 2023},
year={2023}
}