What is Retrieval-based Voice Conversion WebUI?

Community Article Published August 18, 2024

Retrieval-based Voice Conversion WebUI is an open-source framework designed to make voice conversion simple and efficient. Built on the VITS model, it provides an easy-to-use interface for both inference and training, making it accessible even to those with limited experience in machine learning or audio processing. The WebUI supports a range of features, including voice conversion, real-time voice changing, and the ability to train models using small datasets.

UI preview

Training and inference Webui Real-time voice changing GUI
go-web.bat - infer-web.py go-realtime-gui.bat

Key Features:

  • Tone Leakage Reduction: Utilizes top-1 retrieval to replace source features with training-set features.
  • Easy and Fast Training: Can be done even on low-end graphics cards.
  • Model Fusion: Change timbres by merging checkpoints.
  • UVR5 Integration: Quickly separates vocals from instruments.
  • AMD/Intel Acceleration: Supports GPU acceleration on a wide range of hardware.
  • Real-Time Voice Changing: Achieves low latency (down to 90ms) with supported hardware.

Getting Started with Inference and Training

1. Set Up the Environment

To start using the Retrieval-based Voice Conversion WebUI, you’ll first need to prepare your environment. The framework requires Python 3.8 or higher.

Install Core Dependencies:

For NVIDIA GPUs:

pip install torch torchvision torchaudio

For AMD GPUs on Linux:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2

Install Other Dependencies:

Install using poetry:

curl -sSL https://install.python-poetry.org | python3 -
poetry install

Or using pip:

pip install -r requirements.txt

2. Download Pre-trained Models

The WebUI requires several pre-trained models to function properly. You can download these automatically using a provided script:

python tools/download_models.py

Alternatively, download the models manually from Hugging Face.

3. Install FFmpeg

FFmpeg is necessary for handling audio files. Installation steps vary depending on your operating system:

  • Ubuntu/Debian:
    sudo apt install ffmpeg
    
  • macOS:
    brew install ffmpeg
    
  • Windows: Download ffmpeg.exe and ffprobe.exe from Hugging Face and place them in the root folder.

4. Start the WebUI

Once your environment is set up and the necessary models are downloaded, you can start the WebUI.

For general usage:

python infer-web.py

For Windows users, you can also start the WebUI by double-clicking go-web.bat.

Training a New Model

Training your own voice conversion model with Retrieval-based Voice Conversion WebUI is straightforward and can be done with as little as 10 minutes of low-noise speech data.

  1. Prepare Your Dataset: Collect and preprocess your audio data.
  2. Start the Training Interface: Launch the WebUI as described above and navigate to the training section.
  3. Set Training Parameters: Configure the model parameters and training options based on your dataset.
  4. Begin Training: Start the training process. The WebUI will guide you through each step, providing feedback on the model's progress.

Links and Resources

Conclusion

Retrieval-based Voice Conversion WebUI provides a powerful and flexible tool for voice conversion and real-time voice modification. Whether you're a researcher, developer, or enthusiast, this framework offers a robust set of features to explore voice conversion technology.

For more detailed instructions and updates, visit the official GitHub repository.