grad-svc / bigvgan /README.md
maxmax20160403's picture
Upload 39 files
3aa4060
|
raw
history blame
No virus
3.69 kB

Neural Source-Filter BigVGAN

Just For Fun

nsf_bigvgan_mel

Dataset preparation

Put the dataset into the data_raw directory according to the following file structure

data_raw
β”œβ”€β”€β”€speaker0
β”‚   β”œβ”€β”€β”€000001.wav
β”‚   β”œβ”€β”€β”€...
β”‚   └───000xxx.wav
└───speaker1
    β”œβ”€β”€β”€000001.wav
    β”œβ”€β”€β”€...
    └───000xxx.wav

Install dependencies

  • 1 software dependency

    pip install -r requirements.txt

  • 2 download release model, and test

    python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --wave test.wav

Data preprocessing

  • 1, re-sampling: 32kHz

    python prepare/preprocess_a.py -w ./data_raw -o ./data_bigvgan/waves-32k

  • 3, extract pitch

    python prepare/preprocess_f0.py -w data_bigvgan/waves-32k/ -p data_bigvgan/pitch

  • 4, extract mel: [100, length]

    python prepare/preprocess_spec.py -w data_bigvgan/waves-32k/ -s data_bigvgan/mel

  • 5, generate training index

    python prepare/preprocess_train.py

data_bigvgan/
β”‚
└── waves-32k
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.wav
β”‚    β”‚      └── 000xxx.wav
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.wav
β”‚           └── 000xxx.wav
└── pitch
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.pit.npy
β”‚    β”‚      └── 000xxx.pit.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.pit.npy
β”‚           └── 000xxx.pit.npy
└── mel
     └── speaker0
     β”‚      β”œβ”€β”€ 000001.mel.pt
     β”‚      └── 000xxx.mel.pt
     └── speaker1
            β”œβ”€β”€ 000001.mel.pt
            └── 000xxx.mel.pt

Train

  • 1, start training

    python nsf_bigvgan_trainer.py -c configs/nsf_bigvgan.yaml -n nsf_bigvgan

  • 2, resume training

    python nsf_bigvgan_trainer.py -c configs/nsf_bigvgan.yaml -n nsf_bigvgan -p chkpt/nsf_bigvgan/***.pth

  • 3, view log

    tensorboard --logdir logs/

Inference

  • 1, export inference model

    python nsf_bigvgan_export.py --config configs/maxgan.yaml --checkpoint_path chkpt/nsf_bigvgan/***.pt

  • 2, extract mel

    python spec/inference.py -w test.wav -m test.mel.pt

  • 3, extract F0

    python pitch/inference.py -w test.wav -p test.csv

  • 4, infer

    python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --wave test.wav

    or

    python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --mel test.mel.pt --pit test.csv

Augmentation of mel

For the over smooth output of acoustic model, we use gaussian blur for mel when train vocoder

# gaussian blur
model_b = get_gaussian_kernel(kernel_size=5, sigma=2, channels=1).to(device)
# mel blur
mel_b = mel[:, None, :, :]
mel_b = model_b(mel_b)
mel_b = torch.squeeze(mel_b, 1)
mel_r = torch.rand(1).to(device) * 0.5
mel_b = (1 - mel_r) * mel_b + mel_r * mel
# generator
optim_g.zero_grad()
fake_audio = model_g(mel_b, pit)

mel_gaussian_blur

Source of code and References

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf

https://github.com/mindslab-ai/univnet [paper]

https://github.com/NVIDIA/BigVGAN [paper]