Spaces:

vishred18
/

Comparative-Analysis-of-Speech-Synthesis-Models

Build error

Comparative-Analysis-of-Speech-Synthesis-Models

File size: 6,277 Bytes

d5ee97c

# FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Based on the script [`train_fastspeech2.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech2/train_fastspeech2.py).

## Training FastSpeech2 from scratch with LJSpeech dataset.
This example code show you how to train FastSpeech from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech, you can download the dataset at  [link](https://keithito.com/LJ-Speech-Dataset/).

### Step 1: Create Tensorflow based Dataloader (tf.dataset)
First, you need define data loader based on AbstractDataset class (see [`abstract_dataset.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/tensorflow_tts/datasets/abstract_dataset.py)). On this example, a dataloader read dataset from path. I use suffix to classify what file is a charactor, duration and mel-spectrogram (see [`fastspeech2_dataset.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech2/fastspeech2_dataset.py)). If you already have preprocessed version of your target dataset, you don't need to use this example dataloader, you just need refer my dataloader and modify **generator function** to adapt with your case. Normally, a generator function should return [charactor_ids, duration, f0, energy, mel]. Pls see tacotron2-example to know how to extract durations [Extract Duration](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/tacotron2#step-4-extract-duration-from-alignments-for-fastspeech)

### Step 2: Training from scratch
After you redefine your dataloader, pls modify an input arguments, train_dataset and valid_dataset from [`train_fastspeech2.py`](https://github.com/dathudeptrai/TensorflowTTS/tree/master/examples/fastspeech2/train_fastspeech2.py). Here is an example command line to training fastspeech2 from scratch:

```bash
CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/train_fastspeech2.py \
  --train-dir ./dump/train/ \
  --dev-dir ./dump/valid/ \
  --outdir ./examples/fastspeech2/exp/train.fastspeech2.v1/ \
  --config ./examples/fastspeech2/conf/fastspeech2.v1.yaml \
  --use-norm 1 \
  --f0-stat ./dump/stats_f0.npy \
  --energy-stat ./dump/stats_energy.npy \
  --mixed_precision 1 \
  --resume ""
```

IF you want to use MultiGPU to training you can replace `CUDA_VISIBLE_DEVICES=0` by `CUDA_VISIBLE_DEVICES=0,1,2,3` for example. You also need to tune the `batch_size` for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.

In case you want to resume the training progress, please following below example command line:

```bash
--resume ./examples/fastspeech2/exp/train.fastspeech2.v1/checkpoints/ckpt-100000
```

If you want to finetune a model, use `--pretrained` like this with your model filename
```bash
--pretrained pretrained.h5
```

You can also define `var_train_expr` in config file to let model training only on some layers in case you want to fine-tune on your dataset with the same pretrained language and processor. For example, `var_train_expr: "embeddings|encoder|decoder"` means we just training all variables that `embeddings`, `encoder`, `decoder` exist in its name.


### Step 3: Decode mel-spectrogram from folder ids

```bash
CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/decode_fastspeech2.py \
  --rootdir ./dump/valid \
  --outdir ./predictions/fastspeech2.v1/ \
  --config ./examples/fastspeech2/conf/fastspeech2.v1.yaml \
  --checkpoint ./examples/fastspeech2/checkpoints/model-150000.h5 \
  --batch-size 8
```

## What's difference ?
	
* It's not ez for the model to learn predict f0/energy on mel level as paper did. Instead, i average f0/energy based on duration to get f0/energy on charactor level then sum it into encoder_hidden_state before pass though Length-Regulator.
* I apply mean/std normalization for both f0/energy. Note that before calculate mean and std values over all training set, i remove all outliers from f0 and energy.
* Instead using 256 bins for F0 and energy as FastSpeech2 paper, i let model learn to predict real f0/energy value then pass it though one layer Conv1D with kernel_size 9 to upsamples f0/energy scalar to vector as **[FastPitch](https://arxiv.org/abs/2006.06873)** paper suggest.
* There are other modifications to make it work, let read the code carefully to make sure you won't miss anything :D.

## Pretrained Models and Audio samples
| Model                                                                                                          | Conf                                                                                                                        | Lang  | Fs [Hz] | Mel range [Hz] | FFT / Hop / Win [pt] | # iters |
| :------                                                                                                        | :---:                                                                                                                       | :---: | :----:  | :--------:     | :---------------:    | :-----: |
| [fastspeech2.v1](https://drive.google.com/drive/folders/158vFyC2pxw9xKdxp-C5WPEtgtUiWZYE0?usp=sharing)             | [link](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2/conf/fastspeech2.v1.yaml)          | EN    | 22.05k  | 80-7600        | 1024 / 256 / None    | 150k    |
| [fastspeech2.kss.v1](https://drive.google.com/drive/folders/1DU952--jVnJ5SZDSINRs7dVVSpdB7tC_?usp=sharing)             | [link](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2/conf/fastspeech2.kss.v1.yaml)          | KO    | 22.05k  | 80-7600        | 1024 / 256 / None    | 200k    |
| [fastspeech2.kss.v2](https://drive.google.com/drive/folders/1G3-AJnEsu2rYXYgo2iGIVJfCqqfbpwMu?usp=sharing)             | [link](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2/conf/fastspeech2.kss.v2.yaml)          | KO    | 22.05k  | 80-7600        | 1024 / 256 / None    | 200k    |

## Reference

1. [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)
2. [FastPitch: Parallel Text-to-speech with Pitch Prediction](https://arxiv.org/abs/2006.06873)