Spaces:
Running
on
A10G
Running
on
A10G
# Amphion Singing Voice Conversion (SVC) Recipe | |
## Quick Start | |
We provide a **[beginner recipe](MultipleContentsSVC)** to demonstrate how to train a cutting edge SVC model. Specifically, it is also an official implementation of the paper "[Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion](https://arxiv.org/abs/2310.11160)" (NeurIPS 2023 Workshop on Machine Learning for Audio). Some demos can be seen [here](https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html). | |
## Supported Model Architectures | |
The main idea of SVC is to first disentangle the speaker-agnostic representations from the source audio, and then inject the desired speaker information to synthesize the target, which usually utilizes an acoustic decoder and a subsequent waveform synthesizer (vocoder): | |
<br> | |
<div align="center"> | |
<img src="../../imgs/svc/pipeline.png" width="70%"> | |
</div> | |
<br> | |
Until now, Amphion SVC has supported the following features and models: | |
- **Speaker-agnostic Representations**: | |
- Content Features: Sourcing from [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). | |
- Prosody Features: F0 and energy. | |
- **Speaker Embeddings**: | |
- Speaker Look-Up Table. | |
- Reference Encoder (👨💻 developing): It can be used for zero-shot SVC. | |
- **Acoustic Decoders**: | |
- Diffusion-based models: | |
- **[DiffWaveNetSVC](MultipleContentsSVC)**: The encoder is based on Bidirectional Non-Causal Dilated CNN, which is similar to [WaveNet](https://arxiv.org/pdf/1609.03499.pdf), [DiffWave](https://openreview.net/forum?id=a-xFK8Ymz5J), and [DiffSVC](https://ieeexplore.ieee.org/document/9688219). | |
- **[DiffComoSVC](DiffComoSVC)** (👨💻 developing): The diffusion framework is based on [Consistency Model](https://proceedings.mlr.press/v202/song23a.html). It can significantly accelerate the inference process of the diffusion model. | |
- Transformer-based models: | |
- **[TransformerSVC](TransformerSVC)**: Encoder-only and Non-autoregressive Transformer Architecture. | |
- VAE- and Flow-based models: | |
- **[VitsSVC]()** (👨💻 developing): It is designed as a [VITS](https://arxiv.org/abs/2106.06103)-like model whose textual input is replaced by the content features, which is similar to [so-vits-svc](https://github.com/svc-develop-team/so-vits-svc). | |
- **Waveform Synthesizers (Vocoders)**: | |
- The supported vocoders can be seen in [Amphion Vocoder Recipe](../vocoder/README.md). | |