RMSnow's picture
init and interface
df2accb
|
raw
history blame
2.56 kB

Amphion Singing Voice Conversion (SVC) Recipe

Quick Start

We provide a beginner recipe to demonstrate how to train a cutting edge SVC model. Specifically, it is also an official implementation of the paper "Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion" (NeurIPS 2023 Workshop on Machine Learning for Audio). Some demos can be seen here.

Supported Model Architectures

The main idea of SVC is to first disentangle the speaker-agnostic representations from the source audio, and then inject the desired speaker information to synthesize the target, which usually utilizes an acoustic decoder and a subsequent waveform synthesizer (vocoder):



Until now, Amphion SVC has supported the following features and models:

  • Speaker-agnostic Representations:
  • Speaker Embeddings:
    • Speaker Look-Up Table.
    • Reference Encoder (πŸ‘¨β€πŸ’» developing): It can be used for zero-shot SVC.
  • Acoustic Decoders:
    • Diffusion-based models:
    • Transformer-based models:
      • TransformerSVC: Encoder-only and Non-autoregressive Transformer Architecture.
    • VAE- and Flow-based models:
      • VitsSVC (πŸ‘¨β€πŸ’» developing): It is designed as a VITS-like model whose textual input is replaced by the content features, which is similar to so-vits-svc.
  • Waveform Synthesizers (Vocoders):