File size: 5,695 Bytes
2b7bf83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# Kaldi-style all-in-one recipes

This repository provides [Kaldi](https://github.com/kaldi-asr/kaldi)-style recipes, as the same as [ESPnet](https://github.com/espnet/espnet).  
Currently, the following recipes are supported.

- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): English female speaker
- [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut): Japanese female speaker
- [JSSS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss_corpus): Japanese female speaker
- [CSMSC](https://www.data-baker.com/open_source.html): Mandarin female speaker
- [CMU Arctic](http://www.festvox.org/cmu_arctic/): English speakers
- [JNAS](http://research.nii.ac.jp/src/en/JNAS.html): Japanese multi-speaker
- [VCTK](https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html): English multi-speaker
- [LibriTTS](https://arxiv.org/abs/1904.02882): English multi-speaker
- [YesNo](https://arxiv.org/abs/1904.02882): English speaker (For debugging)


## How to run the recipe

```bash
# Let us move on the recipe directory
$ cd egs/ljspeech/voc1

# Run the recipe from scratch
$ ./run.sh

# You can change config via command line
$ ./run.sh --conf <your_customized_yaml_config>

# You can select the stage to start and stop
$ ./run.sh --stage 2 --stop_stage 2

# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2

# If you want to resume training from 10000 steps checkpoint
$ ./run.sh --stage 2 --resume <path>/<to>/checkpoint-10000steps.pkl
```

You can check the command line options in `run.sh`.

The integration with job schedulers such as [slurm](https://slurm.schedmd.com/documentation.html) can be done via `cmd.sh` and  `conf/slurm.conf`.  
If you want to use it, please check [this page](https://kaldi-asr.org/doc/queue.html).

All of the hyperparameters are written in a single yaml format configuration file.  
Please check [this example](https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/ljspeech/voc1/conf/parallel_wavegan.v1.yaml) in ljspeech recipe.

You can monitor the training progress via tensorboard.

```bash
$ tensorboard --logdir exp
```

![](https://user-images.githubusercontent.com/22779813/68100080-58bbc500-ff09-11e9-9945-c835186fd7c2.png)

If you want to accelerate the training, you can try distributed multi-gpu training based on apex.  
You need to install apex for distributed training. Please make sure you already installed it.  
Then you can run distributed multi-gpu training via following command:

```bash
# in the case of the number of gpus = 8
$ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" ./run.sh --stage 2 --n_gpus 8
```

In the case of distributed training, the batch size will be automatically multiplied by the number of gpus.  
Please be careful.

## How to make the recipe for your own dateset

Here, I will show how to make the recipe for your own dataset.

1. Setup your dataset to be the following structure.

    ```bash
    # For single-speaker case
    $ tree /path/to/databse
    /path/to/database
    β”œβ”€β”€ utt_1.wav
    β”œβ”€β”€ utt_2.wav
    β”‚   ...
    └── utt_N.wav
    # The directory can be nested, but each filename must be unique

    # For multi-speaker case
    $ tree /path/to/databse
    /path/to/database
    β”œβ”€β”€ spk_1
    β”‚   β”œβ”€β”€ utt1.wav
    β”œβ”€β”€ spk_2
    β”‚   β”œβ”€β”€ utt1.wav
    β”‚   ...
    └── spk_N
        β”œβ”€β”€ utt1.wav
        ...
    # The directory under each speaker can be nested, but each filename in each speaker directory must be unique
    ```

2. Copy the template directory.

    ```bash
    cd egs

    # For single speaker case
    cp -r template_single_spk <your_dataset_name>

    # For multi speaker case
    cp -r template_multi_spk <your_dataset_name>

    # Move on your recipe
    cd egs/<your_dataset_name>/voc1
    ```

3. Modify the options in `run.sh`.  
   What you need to change at least in `run.sh` is as follows:
   - `db_root`: Root path of the database.
   - `num_dev`: The number of utterances for development set.
   - `num_eval`: The number of utterances for evaluation set.

4. Modify the hyperpameters in `conf/parallel_wavegan.v1.yaml`.  
   What you need to change at least in config is as follows:
    - `sampling_rate`: If you can specify the lower sampling rate, the audio will be downsampled by sox.

5. (Optional) Change command backend in `cmd.sh`.  
   If you are not familiar with kaldi and run in your local env, you do not need to change.  
   See more info on https://kaldi-asr.org/doc/queue.html.

6. Run your recipe.

    ```bash
    # Run all stages from the first stage
    ./run.sh

    # If you want to specify CUDA device
    CUDA_VISIBLE_DEVICES=0 ./run.sh
    ```

If you want to try the other advanced model, please check the config files in `egs/ljspeech/voc1/conf`.

## Run training using ESPnet2-TTS recipe within 5 minutes

Make sure already you finished the espnet2-tts recipe experiments (at least starting the training).

```bash
cd egs

# Please use single spk template for both single and multi spk case
cp -r template_single_spk <recipe_name>

# Move on your recipe
cd egs/<recipe_name>/voc1

# Make symlink of data directory (Better to use absolute path)
mkdir dump data
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw dump/
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/tr_no_dev data/train_nodev
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/dev data/dev
ln -s /path/to/espnet/egs2/<recipe_name>/tts1/dump/raw/eval1 data/eval

# Edit config to match TTS model setting
vim conf/parallel_wavegan.v1.yaml

# Run from stage 1
./run.sh --stage 1 --conf conf/parallel_wavegan.v1.yaml
```

That's it!