vishred18's picture
Upload 364 files
d5ee97c
|
raw
history blame
No virus
6.28 kB

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Based on the script train_fastspeech2.py.

Training FastSpeech2 from scratch with LJSpeech dataset.

This example code show you how to train FastSpeech from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech, you can download the dataset at link.

Step 1: Create Tensorflow based Dataloader (tf.dataset)

First, you need define data loader based on AbstractDataset class (see abstract_dataset.py). On this example, a dataloader read dataset from path. I use suffix to classify what file is a charactor, duration and mel-spectrogram (see fastspeech2_dataset.py). If you already have preprocessed version of your target dataset, you don't need to use this example dataloader, you just need refer my dataloader and modify generator function to adapt with your case. Normally, a generator function should return [charactor_ids, duration, f0, energy, mel]. Pls see tacotron2-example to know how to extract durations Extract Duration

Step 2: Training from scratch

After you redefine your dataloader, pls modify an input arguments, train_dataset and valid_dataset from train_fastspeech2.py. Here is an example command line to training fastspeech2 from scratch:

CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/train_fastspeech2.py \
  --train-dir ./dump/train/ \
  --dev-dir ./dump/valid/ \
  --outdir ./examples/fastspeech2/exp/train.fastspeech2.v1/ \
  --config ./examples/fastspeech2/conf/fastspeech2.v1.yaml \
  --use-norm 1 \
  --f0-stat ./dump/stats_f0.npy \
  --energy-stat ./dump/stats_energy.npy \
  --mixed_precision 1 \
  --resume ""

IF you want to use MultiGPU to training you can replace CUDA_VISIBLE_DEVICES=0 by CUDA_VISIBLE_DEVICES=0,1,2,3 for example. You also need to tune the batch_size for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.

In case you want to resume the training progress, please following below example command line:

--resume ./examples/fastspeech2/exp/train.fastspeech2.v1/checkpoints/ckpt-100000

If you want to finetune a model, use --pretrained like this with your model filename

--pretrained pretrained.h5

You can also define var_train_expr in config file to let model training only on some layers in case you want to fine-tune on your dataset with the same pretrained language and processor. For example, var_train_expr: "embeddings|encoder|decoder" means we just training all variables that embeddings, encoder, decoder exist in its name.

Step 3: Decode mel-spectrogram from folder ids

CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/decode_fastspeech2.py \
  --rootdir ./dump/valid \
  --outdir ./predictions/fastspeech2.v1/ \
  --config ./examples/fastspeech2/conf/fastspeech2.v1.yaml \
  --checkpoint ./examples/fastspeech2/checkpoints/model-150000.h5 \
  --batch-size 8

What's difference ?

  • It's not ez for the model to learn predict f0/energy on mel level as paper did. Instead, i average f0/energy based on duration to get f0/energy on charactor level then sum it into encoder_hidden_state before pass though Length-Regulator.
  • I apply mean/std normalization for both f0/energy. Note that before calculate mean and std values over all training set, i remove all outliers from f0 and energy.
  • Instead using 256 bins for F0 and energy as FastSpeech2 paper, i let model learn to predict real f0/energy value then pass it though one layer Conv1D with kernel_size 9 to upsamples f0/energy scalar to vector as FastPitch paper suggest.
  • There are other modifications to make it work, let read the code carefully to make sure you won't miss anything :D.

Pretrained Models and Audio samples

Model Conf Lang Fs [Hz] Mel range [Hz] FFT / Hop / Win [pt] # iters
fastspeech2.v1 link EN 22.05k 80-7600 1024 / 256 / None 150k
fastspeech2.kss.v1 link KO 22.05k 80-7600 1024 / 256 / None 200k
fastspeech2.kss.v2 link KO 22.05k 80-7600 1024 / 256 / None 200k

Reference

  1. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
  2. FastPitch: Parallel Text-to-speech with Pitch Prediction