Kaldi-style all-in-one recipes

This repository provides Kaldi-style recipes, as the same as ESPnet.
Currently, the following recipes are supported.

LJSpeech: English female speaker
JSUT: Japanese female speaker
JSSS: Japanese female speaker
CSMSC: Mandarin female speaker
CMU Arctic: English speakers
JNAS: Japanese multi-speaker
VCTK: English multi-speaker
LibriTTS: English multi-speaker
YesNo: English speaker (For debugging)

How to run the recipe

# Let us move on the recipe directory
$ cd egs/ljspeech/voc1

# Run the recipe from scratch
$ ./run.sh

# You can change config via command line
$ ./run.sh --conf <your_customized_yaml_config>

# You can select the stage to start and stop
$ ./run.sh --stage 2 --stop_stage 2

# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2

# If you want to resume training from 10000 steps checkpoint
$ ./run.sh --stage 2 --resume <path>/<to>/checkpoint-10000steps.pkl

You can check the command line options in run.sh.

The integration with job schedulers such as slurm can be done via cmd.sh and conf/slurm.conf.
If you want to use it, please check this page.

All of the hyperparameters are written in a single yaml format configuration file.
Please check this example in ljspeech recipe.

You can monitor the training progress via tensorboard.

$ tensorboard --logdir exp

If you want to accelerate the training, you can try distributed multi-gpu training based on apex.
You need to install apex for distributed training. Please make sure you already installed it.
Then you can run distributed multi-gpu training via following command:

# in the case of the number of gpus = 8
$ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" ./run.sh --stage 2 --n_gpus 8

In the case of distributed training, the batch size will be automatically multiplied by the number of gpus.
Please be careful.

How to make the recipe for your own dateset

Here, I will show how to make the recipe for your own dataset.

Setup your dataset to be the following structure.

# For single-speaker case
$ tree /path/to/databse
/path/to/database
├── utt_1.wav
├── utt_2.wav
│   ...
└── utt_N.wav
# The directory can be nested, but each filename must be unique

# For multi-speaker case
$ tree /path/to/databse
/path/to/database
├── spk_1
│   ├── utt1.wav
├── spk_2
│   ├── utt1.wav
│   ...
└── spk_N
    ├── utt1.wav
    ...
# The directory under each speaker can be nested, but each filename in each speaker directory must be unique

Copy the template directory.

cd egs

# For single speaker case
cp -r template_single_spk <your_dataset_name>

# For multi speaker case
cp -r template_multi_spk <your_dataset_name>

# Move on your recipe
cd egs/<your_dataset_name>/voc1

Modify the options in run.sh.
What you need to change at least in run.sh is as follows:
- db_root: Root path of the database.
- num_dev: The number of utterances for development set.
- num_eval: The number of utterances for evaluation set.
Modify the hyperpameters in conf/parallel_wavegan.v1.yaml.
What you need to change at least in config is as follows:
- sampling_rate: If you can specify the lower sampling rate, the audio will be downsampled by sox.
(Optional) Change command backend in cmd.sh.
If you are not familiar with kaldi and run in your local env, you do not need to change.
See more info on https://kaldi-asr.org/doc/queue.html.

Run your recipe.

# Run all stages from the first stage
./run.sh

# If you want to specify CUDA device
CUDA_VISIBLE_DEVICES=0 ./run.sh

If you want to try the other advanced model, please check the config files in egs/ljspeech/voc1/conf.

Run training using ESPnet2-TTS recipe within 5 minutes

Make sure already you finished the espnet2-tts recipe experiments (at least starting the training).

cd egs

# Please use single spk template for both single and multi spk case
cp -r template_single_spk <recipe_name>

# Move on your recipe
cd egs/<recipe_name>/voc1

# Make symlink of data directory (Better to use absolute path)
mkdir dump data
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw dump/
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/tr_no_dev data/train_nodev
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/dev data/dev
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/eval1 data/eval

# Edit config to match TTS model setting
vim conf/parallel_wavegan.v1.yaml

# Run from stage 1
./run.sh --stage 1 --conf conf/parallel_wavegan.v1.yaml

That's it!