fish-diffusion_demo / README.en.md
rinflan's picture
Upload 17 files
5f84dff
<div align="center">
<img alt="LOGO" src="https://cdn.jsdelivr.net/gh/fishaudio/fish-diffusion@main/images/logo_512x512.png" width="256" height="256" />
# Fish Diffusion
<div>
<a href="https://github.com/fishaudio/fish-diffusion/actions/workflows/ci.yml">
<img alt="Build Status" src="https://img.shields.io/github/actions/workflow/status/fishaudio/fish-diffusion/ci.yml?style=flat-square&logo=GitHub">
</a>
<a href="https://hub.docker.com/r/lengyue233/fish-diffusion">
<img alt="Docker Hub" src="https://img.shields.io/docker/cloud/build/lengyue233/fish-diffusion?style=flat-square&logo=Docker&logoColor=white">
</a>
<a href="https://discord.gg/wbYSRBrW2E">
<img alt="Discord" src="https://img.shields.io/discord/1044927142900809739?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square">
</a>
</div>
</div>
------
An easy to understand TTS / SVS / SVC training framework.
> Check our [Wiki](https://fishaudio.github.io/fish-diffusion/) to get started!
[δΈ­ζ–‡ζ–‡ζ‘£](README.md)
## Summary
Using Diffusion Model to solve different voice generating tasks. Compared with the original diffsvc repository, the advantages and disadvantages of this repository are as follows:
+ Support multi-speaker
+ The code structure of this repository is simpler and easier to understand, and all modules are decoupled
+ Support [441khz Diff Singer community vocoder](https://openvpi.github.io/vocoders/)
+ Support multi-machine multi-devices training, support half-precision training, save your training speed and memory
## Preparing the environment
The following commands need to be executed in the conda environment of python 3.10
```bash
# Install PyTorch related core dependencies, skip if installed
# Reference: https://pytorch.org/get-started/locally/
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Install Poetry dependency management tool, skip if installed
# Reference: https://python-poetry.org/docs/#installation
curl -sSL https://install.python-poetry.org | python3 -
# Install the project dependencies
poetry install
```
## Vocoder preparation
Fish Diffusion requires the [OPENVPI 441khz NSF-HiFiGAN](https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1) vocoder to generate audio.
### Automatic download
```bash
python tools/download_nsf_hifigan.py
```
If you are using the script to download the model, you can use the `--agree-license` parameter to agree to the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
```bash
python tools/download_nsf_hifigan.py --agree-license
```
### Manual download
Download and unzip `nsf_hifigan_20221211.zip` from [441khz vocoder](https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1)
Copy the `nsf_hifigan` folder to the `checkpoints` directory (create if not exist)
## Dataset preparation
You only need to put the dataset into the `dataset` directory in the following file structure
```shell
dataset
β”œβ”€β”€β”€train
β”‚ β”œβ”€β”€β”€xxx1-xxx1.wav
β”‚ β”œβ”€β”€β”€...
β”‚ β”œβ”€β”€β”€Lxx-0xx8.wav
β”‚ └───speaker0 (Subdirectory is also supported)
β”‚ └───xxx1-xxx1.wav
└───valid
β”œβ”€β”€β”€xx2-0xxx2.wav
β”œβ”€β”€β”€...
└───xxx7-xxx007.wav
```
```bash
# Extract all data features, such as pitch, text features, mel features, etc.
python tools/preprocessing/extract_features.py --config configs/svc_hubert_soft.py --path dataset --clean
```
## Baseline training
> The project is under active development, please backup your config file
> The project is under active development, please backup your config file
> The project is under active development, please backup your config file
```bash
# Single machine single card / multi-card training
python train.py --config configs/svc_hubert_soft.py
# Resume training
python train.py --config configs/svc_hubert_soft.py --resume [checkpoint]
# Fine-tune the pre-trained model
# Note: You should adjust the learning rate scheduler in the config file to warmup_cosine_finetune
python train.py --config configs/svc_hubert_soft.py --pretrained [checkpoint]
```
## Inference
```bash
# Inference using shell, you can use --help to view more parameters
python inference.py --config [config] \
--checkpoint [checkpoint] \
--input [input audio] \
--output [output audio]
# Gradio Web Inference, other parameters will be used as gradio default parameters
python inference/gradio_inference.py --config [config] \
--checkpoint [checkpoint] \
--gradio
```
## Convert a DiffSVC model to Fish Diffusion
```bash
python tools/diff_svc_converter.py --config configs/svc_hubert_soft_diff_svc.py \
--input-path [DiffSVC ckpt] \
--output-path [Fish Diffusion ckpt]
```
## Contributing
If you have any questions, please submit an issue or pull request.
You should run `tools/lint.sh` before submitting a pull request.
Real-time documentation can be generated by
```bash
sphinx-autobuild docs docs/_build/html
```
## Credits
+ [diff-svc original](https://github.com/prophesier/diff-svc)
+ [diff-svc optimized](https://github.com/innnky/diff-svc/)
+ [DiffSinger](https://github.com/openvpi/DiffSinger/)
+ [SpeechSplit](https://github.com/auspicious3000/SpeechSplit)
## Thanks to all contributors for their efforts
<a href="https://github.com/fishaudio/fish-diffusion/graphs/contributors" target="_blank">
<img src="https://contrib.rocks/image?repo=fishaudio/fish-diffusion" />
</a>