Siddhant commited on
Commit
4efbc73
1 Parent(s): 3730bbb

import from zenodo

Browse files
Files changed (18) hide show
  1. README.md +50 -0
  2. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/config.yaml +290 -0
  3. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/backward_time.png +0 -0
  4. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/duration_loss.png +0 -0
  5. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/energy_loss.png +0 -0
  6. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/forward_time.png +0 -0
  7. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/iter_time.png +0 -0
  8. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/l1_loss.png +0 -0
  9. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/loss.png +0 -0
  10. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/lr_0.png +0 -0
  11. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/optim_step_time.png +0 -0
  12. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/pitch_loss.png +0 -0
  13. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/train_time.png +0 -0
  14. exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/train.loss.ave_5best.pth +3 -0
  15. exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/energy_stats.npz +0 -0
  16. exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/feats_stats.npz +0 -0
  17. exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/pitch_stats.npz +0 -0
  18. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - text-to-speech
6
+ language: en
7
+ datasets:
8
+ - vctk
9
+ license: cc-by-4.0
10
+ ---
11
+ ## Example ESPnet2 TTS model
12
+ ### `kan-bayashi/vctk_gst_conformer_fastspeech2`
13
+ ♻️ Imported from https://zenodo.org/record/4036264/
14
+
15
+ This model was trained by kan-bayashi using vctk/tts1 recipe in [espnet](https://github.com/espnet/espnet/).
16
+ ### Demo: How to use in ESPnet2
17
+ ```python
18
+ # coming soon
19
+ ```
20
+ ### Citing ESPnet
21
+ ```BibTex
22
+ @inproceedings{watanabe2018espnet,
23
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
24
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
25
+ year={2018},
26
+ booktitle={Proceedings of Interspeech},
27
+ pages={2207--2211},
28
+ doi={10.21437/Interspeech.2018-1456},
29
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
30
+ }
31
+ @inproceedings{hayashi2020espnet,
32
+ title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
33
+ author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
34
+ booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
35
+ pages={7654--7658},
36
+ year={2020},
37
+ organization={IEEE}
38
+ }
39
+ ```
40
+ or arXiv:
41
+ ```bibtex
42
+ @misc{watanabe2018espnet,
43
+ title={ESPnet: End-to-End Speech Processing Toolkit},
44
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Enrique Yalta Soplin and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
45
+ year={2018},
46
+ eprint={1804.00015},
47
+ archivePrefix={arXiv},
48
+ primaryClass={cs.CL}
49
+ }
50
+ ```
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/config.yaml ADDED
@@ -0,0 +1,290 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/tuning/train_gst_conformer_fastspeech2.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ dry_run: false
5
+ iterator_type: sequence
6
+ output_dir: exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space
7
+ ngpu: 1
8
+ seed: 0
9
+ num_workers: 1
10
+ num_att_plot: 3
11
+ dist_backend: nccl
12
+ dist_init_method: env://
13
+ dist_world_size: null
14
+ dist_rank: null
15
+ local_rank: 0
16
+ dist_master_addr: null
17
+ dist_master_port: null
18
+ dist_launcher: null
19
+ multiprocessing_distributed: false
20
+ cudnn_enabled: true
21
+ cudnn_benchmark: false
22
+ cudnn_deterministic: true
23
+ collect_stats: false
24
+ write_collected_feats: false
25
+ max_epoch: 1000
26
+ patience: null
27
+ val_scheduler_criterion:
28
+ - valid
29
+ - loss
30
+ early_stopping_criterion:
31
+ - valid
32
+ - loss
33
+ - min
34
+ best_model_criterion:
35
+ - - valid
36
+ - loss
37
+ - min
38
+ - - train
39
+ - loss
40
+ - min
41
+ keep_nbest_models: 5
42
+ grad_clip: 1.0
43
+ grad_clip_type: 2.0
44
+ grad_noise: false
45
+ accum_grad: 10
46
+ no_forward_run: false
47
+ resume: true
48
+ train_dtype: float32
49
+ use_amp: false
50
+ log_interval: null
51
+ pretrain_path: []
52
+ pretrain_key: []
53
+ num_iters_per_epoch: 500
54
+ batch_size: 20
55
+ valid_batch_size: null
56
+ batch_bins: 2400000
57
+ valid_batch_bins: null
58
+ train_shape_file:
59
+ - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/text_shape.phn
60
+ - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/speech_shape
61
+ valid_shape_file:
62
+ - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/valid/text_shape.phn
63
+ - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/valid/speech_shape
64
+ batch_type: numel
65
+ valid_batch_type: null
66
+ fold_length:
67
+ - 150
68
+ - 240000
69
+ sort_in_batch: descending
70
+ sort_batch: descending
71
+ multiple_iterator: false
72
+ chunk_length: 500
73
+ chunk_shift_ratio: 0.5
74
+ num_cache_chunks: 1024
75
+ train_data_path_and_name_and_type:
76
+ - - dump/raw/tr_no_dev/text
77
+ - text
78
+ - text
79
+ - - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/tr_no_dev/durations
80
+ - durations
81
+ - text_int
82
+ - - dump/raw/tr_no_dev/wav.scp
83
+ - speech
84
+ - sound
85
+ - - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/collect_feats/pitch.scp
86
+ - pitch
87
+ - npy
88
+ - - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/collect_feats/energy.scp
89
+ - energy
90
+ - npy
91
+ valid_data_path_and_name_and_type:
92
+ - - dump/raw/dev/text
93
+ - text
94
+ - text
95
+ - - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/dev/durations
96
+ - durations
97
+ - text_int
98
+ - - dump/raw/dev/wav.scp
99
+ - speech
100
+ - sound
101
+ - - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/valid/collect_feats/pitch.scp
102
+ - pitch
103
+ - npy
104
+ - - exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/valid/collect_feats/energy.scp
105
+ - energy
106
+ - npy
107
+ allow_variable_data_keys: false
108
+ max_cache_size: 0.0
109
+ valid_max_cache_size: null
110
+ optim: adam
111
+ optim_conf:
112
+ lr: 1.0
113
+ scheduler: noamlr
114
+ scheduler_conf:
115
+ model_size: 384
116
+ warmup_steps: 4000
117
+ token_list:
118
+ - <blank>
119
+ - <unk>
120
+ - OY0
121
+ - ''''
122
+ - OY2
123
+ - ER2
124
+ - UH0
125
+ - '!'
126
+ - EY0
127
+ - AW0
128
+ - AA0
129
+ - UH2
130
+ - UW2
131
+ - AY0
132
+ - AO2
133
+ - AO0
134
+ - AE2
135
+ - AH2
136
+ - AE0
137
+ - AA2
138
+ - IY2
139
+ - EH0
140
+ - AW2
141
+ - ZH
142
+ - AY2
143
+ - OY1
144
+ - IH2
145
+ - UW0
146
+ - EY2
147
+ - EH2
148
+ - OW2
149
+ - OW0
150
+ - '?'
151
+ - CH
152
+ - ER1
153
+ - TH
154
+ - UH1
155
+ - AW1
156
+ - JH
157
+ - Y
158
+ - SH
159
+ - NG
160
+ - ','
161
+ - G
162
+ - OW1
163
+ - AO1
164
+ - IY0
165
+ - UW1
166
+ - EY1
167
+ - AY1
168
+ - HH
169
+ - F
170
+ - ER0
171
+ - V
172
+ - P
173
+ - B
174
+ - AH1
175
+ - IY1
176
+ - IH0
177
+ - AA1
178
+ - EH1
179
+ - AE1
180
+ - M
181
+ - W
182
+ - K
183
+ - DH
184
+ - Z
185
+ - .
186
+ - L
187
+ - D
188
+ - IH1
189
+ - R
190
+ - S
191
+ - N
192
+ - T
193
+ - AH0
194
+ - <sos/eos>
195
+ odim: null
196
+ model_conf: {}
197
+ use_preprocessor: true
198
+ token_type: phn
199
+ bpemodel: null
200
+ non_linguistic_symbols: null
201
+ cleaner: tacotron
202
+ g2p: g2p_en_no_space
203
+ feats_extract: fbank
204
+ feats_extract_conf:
205
+ fs: 24000
206
+ fmin: 80
207
+ fmax: 7600
208
+ n_mels: 80
209
+ hop_length: 300
210
+ n_fft: 2048
211
+ win_length: 1200
212
+ normalize: global_mvn
213
+ normalize_conf:
214
+ stats_file: exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/feats_stats.npz
215
+ tts: fastspeech2
216
+ tts_conf:
217
+ adim: 384
218
+ aheads: 2
219
+ elayers: 4
220
+ eunits: 1536
221
+ dlayers: 4
222
+ dunits: 1536
223
+ positionwise_layer_type: conv1d
224
+ positionwise_conv_kernel_size: 3
225
+ duration_predictor_layers: 2
226
+ duration_predictor_chans: 256
227
+ duration_predictor_kernel_size: 3
228
+ postnet_layers: 5
229
+ postnet_filts: 5
230
+ postnet_chans: 256
231
+ use_masking: true
232
+ encoder_normalize_before: false
233
+ decoder_normalize_before: false
234
+ reduction_factor: 1
235
+ encoder_type: conformer
236
+ decoder_type: conformer
237
+ conformer_pos_enc_layer_type: rel_pos
238
+ conformer_self_attn_layer_type: rel_selfattn
239
+ conformer_activation_type: swish
240
+ use_macaron_style_in_conformer: true
241
+ use_cnn_in_conformer: true
242
+ conformer_enc_kernel_size: 7
243
+ conformer_dec_kernel_size: 31
244
+ init_type: xavier_uniform
245
+ transformer_enc_dropout_rate: 0.2
246
+ transformer_enc_positional_dropout_rate: 0.2
247
+ transformer_enc_attn_dropout_rate: 0.2
248
+ transformer_dec_dropout_rate: 0.2
249
+ transformer_dec_positional_dropout_rate: 0.2
250
+ transformer_dec_attn_dropout_rate: 0.2
251
+ pitch_predictor_layers: 5
252
+ pitch_predictor_chans: 256
253
+ pitch_predictor_kernel_size: 5
254
+ pitch_predictor_dropout: 0.5
255
+ pitch_embed_kernel_size: 1
256
+ pitch_embed_dropout: 0.0
257
+ stop_gradient_from_pitch_predictor: true
258
+ energy_predictor_layers: 2
259
+ energy_predictor_chans: 256
260
+ energy_predictor_kernel_size: 3
261
+ energy_predictor_dropout: 0.5
262
+ energy_embed_kernel_size: 1
263
+ energy_embed_dropout: 0.0
264
+ stop_gradient_from_energy_predictor: false
265
+ use_gst: true
266
+ gst_heads: 8
267
+ gst_tokens: 128
268
+ pitch_extract: dio
269
+ pitch_extract_conf:
270
+ fs: 24000
271
+ n_fft: 2048
272
+ hop_length: 300
273
+ f0max: 400
274
+ f0min: 80
275
+ pitch_normalize: global_mvn
276
+ pitch_normalize_conf:
277
+ stats_file: exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/pitch_stats.npz
278
+ energy_extract: energy
279
+ energy_extract_conf:
280
+ fs: 24000
281
+ n_fft: 2048
282
+ hop_length: 300
283
+ win_length: 1200
284
+ energy_normalize: global_mvn
285
+ energy_normalize_conf:
286
+ stats_file: exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/energy_stats.npz
287
+ required:
288
+ - output_dir
289
+ - token_list
290
+ distributed: false
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/backward_time.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/duration_loss.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/energy_loss.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/forward_time.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/iter_time.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/l1_loss.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/loss.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/lr_0.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/optim_step_time.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/pitch_loss.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/train_time.png ADDED
exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/train.loss.ave_5best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f00de50392c36b7614808d6b265524efc451c9d9bd819e005686d62cfb9bdd5a
3
+ size 284293523
exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/energy_stats.npz ADDED
Binary file (770 Bytes). View file
 
exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/feats_stats.npz ADDED
Binary file (1.4 kB). View file
 
exp/tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/pitch_stats.npz ADDED
Binary file (770 Bytes). View file
 
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: 0.8.0
2
+ files:
3
+ model_file: exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/train.loss.ave_5best.pth
4
+ python: "3.7.3 (default, Mar 27 2019, 22:11:17) \n[GCC 7.3.0]"
5
+ timestamp: 1600432964.337782
6
+ torch: 1.6.0
7
+ yaml_files:
8
+ train_config: exp/tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/config.yaml