alcm / vocoder /BigVGAN /README.md
inLine-XJY's picture
Upload 335 files
2b5b9ef verified
|
raw
history blame
4.32 kB

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

Paper

Audio demo

Installation

Clone the repository and install dependencies.

# the codebase has been tested on Python 3.8 / 3.10 with PyTorch 1.12.1 / 1.13 conda binaries
git clone https://github.com/NVIDIA/BigVGAN
pip install -r requirements.txt

Create symbolic link to the root of the dataset. The codebase uses filelist with the relative path from the dataset. Below are the example commands for LibriTTS dataset.

cd LibriTTS && \
ln -s /path/to/your/LibriTTS/train-clean-100 train-clean-100 && \
ln -s /path/to/your/LibriTTS/train-clean-360 train-clean-360 && \
ln -s /path/to/your/LibriTTS/train-other-500 train-other-500 && \
ln -s /path/to/your/LibriTTS/dev-clean dev-clean && \
ln -s /path/to/your/LibriTTS/dev-other dev-other && \
ln -s /path/to/your/LibriTTS/test-clean test-clean && \
ln -s /path/to/your/LibriTTS/test-other test-other && \
cd ..

Training

Train BigVGAN model. Below is an example command for training BigVGAN using LibriTTS dataset at 24kHz with a full 100-band mel spectrogram as input.

python train.py \
--config configs/bigvgan_24khz_100band.json \
--input_wavs_dir LibriTTS \
--input_training_file LibriTTS/train-full.txt \
--input_validation_file LibriTTS/val-full.txt \
--list_input_unseen_wavs_dir LibriTTS LibriTTS \
--list_input_unseen_validation_file LibriTTS/dev-clean.txt LibriTTS/dev-other.txt \
--checkpoint_path exp/bigvgan

Synthesis

Synthesize from BigVGAN model. Below is an example command for generating audio from the model. It computes mel spectrograms using wav files from --input_wavs_dir and saves the generated audio to --output_dir.

python inference.py \
--checkpoint_file exp/bigvgan/g_05000000 \
--input_wavs_dir /path/to/your/input_wav \
--output_dir /path/to/your/output_wav

inference_e2e.py supports synthesis directly from the mel spectrogram saved in .npy format, with shapes [1, channel, frame] or [channel, frame]. It loads mel spectrograms from --input_mels_dir and saves the generated audio to --output_dir.

Make sure that the STFT hyperparameters for mel spectrogram are the same as the model, which are defined in config.json of the corresponding model.

python inference_e2e.py \
--checkpoint_file exp/bigvgan/g_05000000 \
--input_mels_dir /path/to/your/input_mel \
--output_dir /path/to/your/output_wav

Pretrained Models

We provide the pretrained models. One can download the checkpoints of generator (e.g., g_05000000) and discriminator (e.g., do_05000000) within the listed folders.

Folder Name Sampling Rate Mel band fmax Params. Dataset Fine-Tuned
bigvgan_24khz_100band 24 kHz 100 12000 112M LibriTTS No
bigvgan_base_24khz_100band 24 kHz 100 12000 14M LibriTTS No
bigvgan_22khz_80band 22 kHz 80 8000 112M LibriTTS + VCTK + LJSpeech No
bigvgan_base_22khz_80band 22 kHz 80 8000 14M LibriTTS + VCTK + LJSpeech No

The paper results are based on 24kHz BigVGAN models trained on LibriTTS dataset. We also provide 22kHz BigVGAN models with band-limited setup (i.e., fmax=8000) for TTS applications. Note that, the latest checkpoints use snakebeta activation with log scale parameterization, which have the best overall quality.

TODO

Current codebase only provides a plain PyTorch implementation for the filtered nonlinearity. We are working on a fast CUDA kernel implementation, which will be released in the future.

References