|
--- |
|
license: creativeml-openrail-m |
|
language: |
|
- en |
|
pipeline_tag: audio-to-audio |
|
tags: |
|
- voice-to-voice |
|
- ddsp-svc |
|
--- |
|
|
|
# Howdy |
|
|
|
These are a few test models I made using (and for use with) [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC). |
|
|
|
I am not experienced with this software or technology, but hope to provide samples which facilitate adoption and interest in this project and associated technologies. |
|
|
|
All models are based on 44.1khz samples from a English speakers, though thanks to [DDSP](https://magenta.tensorflow.org/ddsp), they're generally fairly decent with use in a variety of other languages. |
|
|
|
Training is done following the suggestions and best practices according to the DDSP-SVC project, with initial learning rates ranging between 0.00010 and 0.00020. |
|
|
|
If using DDSP-SVC's **gui_diff.py**, keep in mind that pitch adjustment is probably required if your voice is deeper than the character. |
|
|
|
For any/all questions/comments/suggestions, please use the Community section here. |
|
|
|
## Models |
|
- PrimReaper - (Stereo) Trained on YouTube content from popular YouTuber "The Prim Reaper" |
|
- Panam - (Mono) Trained on extracted audio content from the Cyberpunk 2077 character dialogue named "Panam" |
|
- V-F - (Mono) Trained on extracted dialogue audio from the Female "V" character in Cyberpunk 2077 |
|
- Nora - (Mono) Trained on Fallout 4 dialogue audio from the game character "Nora" |
|
|
|
## Usage |
|
|
|
To use these, place the model file (model_XXXXXX.pt) and configuration file (config.yaml) in a directory. |
|
|
|
**It's rather important to mention that each model file should be in a distinct directory with its accompanying config.yaml or your results may be off/weird/broken.** |
|
|
|
## Settings |
|
|
|
For realtime inference, my settings are generally as follows: |
|
|
|
**Normal Settings** |
|
- Speaker ID: Always "1" |
|
- Response Threshold: -45 (This is mic specific) |
|
- Pitch: 10 - 15 depending on model |
|
- Sampling rate: Always 44100 for my models |
|
- Mix Speaker: All models are single-speaker, so this is **not** checked |
|
|
|
**Performance Settings** |
|
- Segmentation Size: 0.45 |
|
- Cross fade duration: 0.07 |
|
- Historical blocks used: 8 |
|
- f0Extractor: rmvpe |
|
- Phase vocoder: Depending on the model I enable it if model output feels robotic/stuttery, and disable if it sounds "buttery" |
|
|
|
**Diffusion Settings** |
|
- K-steps: 200 |
|
- Speedup: 10 |
|
- Diffusion method: ddim or pndm, depending on model |
|
- Encode silence: Depends on the model, but usually "on" for the best quality |
|
|
|
|