Howdy

These are a few test models I made using (and for use with) DDSP-SVC.

I am not experienced with this software or technology, but hope to provide samples which facilitate adoption and interest in this project and associated technologies.

All models are based on 44.1khz samples from a English speakers, though thanks to DDSP, they're generally fairly decent with use in a variety of other languages.

Training is done following the suggestions and best practices according to the DDSP-SVC project, with initial learning rates ranging between 0.00010 and 0.00020.

If using DDSP-SVC's gui_diff.py, keep in mind that pitch adjustment is probably required if your voice is deeper than the character.

For any/all questions/comments/suggestions, please use the Community section here.

Models

PrimReaper - (Stereo) Trained on YouTube content from popular YouTuber "The Prim Reaper"
Panam - (Mono) Trained on extracted audio content from the Cyberpunk 2077 character dialogue named "Panam"
V-F - (Mono) Trained on extracted dialogue audio from the Female "V" character in Cyberpunk 2077
Nora - (Mono) Trained on Fallout 4 dialogue audio from the game character "Nora"

Usage

To use these, place the model file (model_XXXXXX.pt) and configuration file (config.yaml) in a directory.

It's rather important to mention that each model file should be in a distinct directory with its accompanying config.yaml or your results may be off/weird/broken.

Settings

For realtime inference, my settings are generally as follows:

Normal Settings

Speaker ID: Always "1"
Response Threshold: -45 (This is mic specific)
Pitch: 10 - 15 depending on model
Sampling rate: Always 44100 for my models
Mix Speaker: All models are single-speaker, so this is not checked

Performance Settings

Segmentation Size: 0.45
Cross fade duration: 0.07
Historical blocks used: 8
f0Extractor: rmvpe
Phase vocoder: Depending on the model I enable it if model output feels robotic/stuttery, and disable if it sounds "buttery"

Diffusion Settings

K-steps: 200
Speedup: 10
Diffusion method: ddim or pndm, depending on model
Encode silence: Depends on the model, but usually "on" for the best quality