danieloneill's picture
Update README.md
eb02c48
---
license: creativeml-openrail-m
language:
- en
pipeline_tag: audio-to-audio
tags:
- voice-to-voice
- ddsp-svc
---
# Howdy
These are a few test models I made using (and for use with) [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC).
I am not experienced with this software or technology, but hope to provide samples which facilitate adoption and interest in this project and associated technologies.
All models are based on 44.1khz samples from a English speakers, though thanks to [DDSP](https://magenta.tensorflow.org/ddsp), they're generally fairly decent with use in a variety of other languages.
Training is done following the suggestions and best practices according to the DDSP-SVC project, with initial learning rates ranging between 0.00010 and 0.00020.
If using DDSP-SVC's **gui_diff.py**, keep in mind that pitch adjustment is probably required if your voice is deeper than the character.
For any/all questions/comments/suggestions, please use the Community section here.
## Models
- PrimReaper - (Stereo) Trained on YouTube content from popular YouTuber "The Prim Reaper"
- Panam - (Mono) Trained on extracted audio content from the Cyberpunk 2077 character dialogue named "Panam"
- V-F - (Mono) Trained on extracted dialogue audio from the Female "V" character in Cyberpunk 2077
- Nora - (Mono) Trained on Fallout 4 dialogue audio from the game character "Nora"
## Usage
To use these, place the model file (model_XXXXXX.pt) and configuration file (config.yaml) in a directory.
**It's rather important to mention that each model file should be in a distinct directory with its accompanying config.yaml or your results may be off/weird/broken.**
## Settings
For realtime inference, my settings are generally as follows:
**Normal Settings**
- Speaker ID: Always "1"
- Response Threshold: -45 (This is mic specific)
- Pitch: 10 - 15 depending on model
- Sampling rate: Always 44100 for my models
- Mix Speaker: All models are single-speaker, so this is **not** checked
**Performance Settings**
- Segmentation Size: 0.45
- Cross fade duration: 0.07
- Historical blocks used: 8
- f0Extractor: rmvpe
- Phase vocoder: Depending on the model I enable it if model output feels robotic/stuttery, and disable if it sounds "buttery"
**Diffusion Settings**
- K-steps: 200
- Speedup: 10
- Diffusion method: ddim or pndm, depending on model
- Encode silence: Depends on the model, but usually "on" for the best quality