Siddhant commited on
Commit
ee879b9
1 Parent(s): 9dfcc7b

import from zenodo

Browse files
Files changed (18) hide show
  1. README.md +50 -0
  2. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/config.yaml +417 -0
  3. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/backward_time.png +0 -0
  4. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/duration_loss.png +0 -0
  5. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/energy_loss.png +0 -0
  6. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/forward_time.png +0 -0
  7. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/iter_time.png +0 -0
  8. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/l1_loss.png +0 -0
  9. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/loss.png +0 -0
  10. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/lr_0.png +0 -0
  11. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/optim_step_time.png +0 -0
  12. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/pitch_loss.png +0 -0
  13. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/train_time.png +0 -0
  14. exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/train.loss.ave_5best.pth +3 -0
  15. exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/energy_stats.npz +0 -0
  16. exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/feats_stats.npz +0 -0
  17. exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/pitch_stats.npz +0 -0
  18. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - text-to-speech
6
+ language: zh
7
+ datasets:
8
+ - csmsc
9
+ license: cc-by-4.0
10
+ ---
11
+ ## Example ESPnet2 TTS model
12
+ ### `kan-bayashi/csmsc_conformer_fastspeech2`
13
+ ♻️ Imported from https://zenodo.org/record/4031955/
14
+
15
+ This model was trained by kan-bayashi using csmsc/tts1 recipe in [espnet](https://github.com/espnet/espnet/).
16
+ ### Demo: How to use in ESPnet2
17
+ ```python
18
+ # coming soon
19
+ ```
20
+ ### Citing ESPnet
21
+ ```BibTex
22
+ @inproceedings{watanabe2018espnet,
23
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
24
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
25
+ year={2018},
26
+ booktitle={Proceedings of Interspeech},
27
+ pages={2207--2211},
28
+ doi={10.21437/Interspeech.2018-1456},
29
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
30
+ }
31
+ @inproceedings{hayashi2020espnet,
32
+ title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
33
+ author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
34
+ booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
35
+ pages={7654--7658},
36
+ year={2020},
37
+ organization={IEEE}
38
+ }
39
+ ```
40
+ or arXiv:
41
+ ```bibtex
42
+ @misc{watanabe2018espnet,
43
+ title={ESPnet: End-to-End Speech Processing Toolkit},
44
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Enrique Yalta Soplin and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
45
+ year={2018},
46
+ eprint={1804.00015},
47
+ archivePrefix={arXiv},
48
+ primaryClass={cs.CL}
49
+ }
50
+ ```
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/config.yaml ADDED
@@ -0,0 +1,417 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/tuning/train_conformer_fastspeech2.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ dry_run: false
5
+ iterator_type: sequence
6
+ output_dir: exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone
7
+ ngpu: 1
8
+ seed: 0
9
+ num_workers: 1
10
+ num_att_plot: 3
11
+ dist_backend: nccl
12
+ dist_init_method: env://
13
+ dist_world_size: null
14
+ dist_rank: null
15
+ local_rank: 0
16
+ dist_master_addr: null
17
+ dist_master_port: null
18
+ dist_launcher: null
19
+ multiprocessing_distributed: false
20
+ cudnn_enabled: true
21
+ cudnn_benchmark: false
22
+ cudnn_deterministic: true
23
+ collect_stats: false
24
+ write_collected_feats: false
25
+ max_epoch: 1000
26
+ patience: null
27
+ val_scheduler_criterion:
28
+ - valid
29
+ - loss
30
+ early_stopping_criterion:
31
+ - valid
32
+ - loss
33
+ - min
34
+ best_model_criterion:
35
+ - - valid
36
+ - loss
37
+ - min
38
+ - - train
39
+ - loss
40
+ - min
41
+ keep_nbest_models: 5
42
+ grad_clip: 1.0
43
+ grad_clip_type: 2.0
44
+ grad_noise: false
45
+ accum_grad: 10
46
+ no_forward_run: false
47
+ resume: true
48
+ train_dtype: float32
49
+ use_amp: false
50
+ log_interval: null
51
+ pretrain_path: []
52
+ pretrain_key: []
53
+ num_iters_per_epoch: 500
54
+ batch_size: 20
55
+ valid_batch_size: null
56
+ batch_bins: 2400000
57
+ valid_batch_bins: null
58
+ train_shape_file:
59
+ - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/text_shape.phn
60
+ - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/speech_shape
61
+ valid_shape_file:
62
+ - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/valid/text_shape.phn
63
+ - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/valid/speech_shape
64
+ batch_type: numel
65
+ valid_batch_type: null
66
+ fold_length:
67
+ - 150
68
+ - 240000
69
+ sort_in_batch: descending
70
+ sort_batch: descending
71
+ multiple_iterator: false
72
+ chunk_length: 500
73
+ chunk_shift_ratio: 0.5
74
+ num_cache_chunks: 1024
75
+ train_data_path_and_name_and_type:
76
+ - - dump/raw/tr_no_dev/text
77
+ - text
78
+ - text
79
+ - - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/tr_no_dev/durations
80
+ - durations
81
+ - text_int
82
+ - - dump/raw/tr_no_dev/wav.scp
83
+ - speech
84
+ - sound
85
+ - - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/collect_feats/pitch.scp
86
+ - pitch
87
+ - npy
88
+ - - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/collect_feats/energy.scp
89
+ - energy
90
+ - npy
91
+ valid_data_path_and_name_and_type:
92
+ - - dump/raw/dev/text
93
+ - text
94
+ - text
95
+ - - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/dev/durations
96
+ - durations
97
+ - text_int
98
+ - - dump/raw/dev/wav.scp
99
+ - speech
100
+ - sound
101
+ - - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/valid/collect_feats/pitch.scp
102
+ - pitch
103
+ - npy
104
+ - - exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/valid/collect_feats/energy.scp
105
+ - energy
106
+ - npy
107
+ allow_variable_data_keys: false
108
+ max_cache_size: 0.0
109
+ valid_max_cache_size: null
110
+ optim: adam
111
+ optim_conf:
112
+ lr: 1.0
113
+ scheduler: noamlr
114
+ scheduler_conf:
115
+ model_size: 384
116
+ warmup_steps: 4000
117
+ token_list:
118
+ - <blank>
119
+ - <unk>
120
+ - "\uFF30"
121
+ - "\uFF22"
122
+ - "\xFC"
123
+ - an
124
+ - ueng3
125
+ - '2'
126
+ - uen
127
+ - ei
128
+ - ua
129
+ - ao
130
+ - u
131
+ - ueng4
132
+ - uo
133
+ - ang
134
+ - ou
135
+ - v2
136
+ - ueng1
137
+ - o
138
+ - io1
139
+ - "\xFCn3"
140
+ - er
141
+ - ve4
142
+ - o3
143
+ - uai2
144
+ - uen3
145
+ - uen1
146
+ - uai3
147
+ - "\xFCe3"
148
+ - iou1
149
+ - iong2
150
+ - ia2
151
+ - uai1
152
+ - iong1
153
+ - "\xFCan1"
154
+ - "\xFCe1"
155
+ - v4
156
+ - ua3
157
+ - ia
158
+ - iong3
159
+ - uei3
160
+ - ua2
161
+ - ia3
162
+ - uei1
163
+ - o1
164
+ - o4
165
+ - "\xFCn2"
166
+ - un2
167
+ - er3
168
+ - "\xFCn1"
169
+ - uen4
170
+ - un3
171
+ - iu1
172
+ - "\xFCn4"
173
+ - uen2
174
+ - "\xFCan3"
175
+ - un4
176
+ - "\xFCan4"
177
+ - iu3
178
+ - ua1
179
+ - uei2
180
+ - "\uFF01"
181
+ - iou4
182
+ - iou2
183
+ - er4
184
+ - o2
185
+ - ei1
186
+ - iao2
187
+ - uang4
188
+ - "\xFC1"
189
+ - ui2
190
+ - v3
191
+ - uang2
192
+ - iong4
193
+ - un1
194
+ - ui1
195
+ - ua4
196
+ - ao2
197
+ - en
198
+ - a
199
+ - iu2
200
+ - uang1
201
+ - uang3
202
+ - "\xFCe2"
203
+ - in3
204
+ - "\uFF1F"
205
+ - uai4
206
+ - "\xFCe4"
207
+ - uan2
208
+ - ou2
209
+ - eng3
210
+ - ui3
211
+ - uan4
212
+ - a2
213
+ - ie2
214
+ - ong3
215
+ - iang2
216
+ - ie1
217
+ - in4
218
+ - iao1
219
+ - e1
220
+ - in2
221
+ - en4
222
+ - uan3
223
+ - "\xFC2"
224
+ - ing3
225
+ - i
226
+ - ei2
227
+ - ei3
228
+ - iang1
229
+ - er2
230
+ - ia4
231
+ - uo2
232
+ - "\xFC3"
233
+ - uan1
234
+ - ia1
235
+ - e3
236
+ - ong4
237
+ - ie4
238
+ - ai1
239
+ - en3
240
+ - iang3
241
+ - eng4
242
+ - iang4
243
+ - ao1
244
+ - ou1
245
+ - ang2
246
+ - ai3
247
+ - iu4
248
+ - "\xFCan2"
249
+ - ang3
250
+ - en1
251
+ - ong2
252
+ - uei4
253
+ - ei4
254
+ - iao3
255
+ - "\xFC4"
256
+ - an2
257
+ - ing4
258
+ - an3
259
+ - a3
260
+ - ie3
261
+ - an1
262
+ - ian3
263
+ - uo1
264
+ - ing1
265
+ - ou4
266
+ - ian1
267
+ - ou3
268
+ - eng1
269
+ - ang1
270
+ - in1
271
+ - a4
272
+ - eng2
273
+ - uo4
274
+ - u1
275
+ - ang4
276
+ - iou3
277
+ - iao4
278
+ - ian2
279
+ - u2
280
+ - ui4
281
+ - e2
282
+ - en2
283
+ - u3
284
+ - ing2
285
+ - ao4
286
+ - ong1
287
+ - an4
288
+ - ai2
289
+ - ao3
290
+ - uo3
291
+ - ian4
292
+ - p
293
+ - c
294
+ - a1
295
+ - ai4
296
+ - e4
297
+ - s
298
+ - k
299
+ - r
300
+ - i2
301
+ - f
302
+ - n
303
+ - u4
304
+ - ch
305
+ - i3
306
+ - i1
307
+ - q
308
+ - z
309
+ - m
310
+ - t
311
+ - g
312
+ - b
313
+ - e
314
+ - h
315
+ - i4
316
+ - x
317
+ - "\uFF0C"
318
+ - zh
319
+ - "\u3002"
320
+ - l
321
+ - j
322
+ - sh
323
+ - d
324
+ - <sos/eos>
325
+ odim: null
326
+ model_conf: {}
327
+ use_preprocessor: true
328
+ token_type: phn
329
+ bpemodel: null
330
+ non_linguistic_symbols: null
331
+ cleaner: null
332
+ g2p: pypinyin_g2p_phone
333
+ feats_extract: fbank
334
+ feats_extract_conf:
335
+ fs: 24000
336
+ fmin: 80
337
+ fmax: 7600
338
+ n_mels: 80
339
+ hop_length: 300
340
+ n_fft: 2048
341
+ win_length: 1200
342
+ normalize: global_mvn
343
+ normalize_conf:
344
+ stats_file: exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/feats_stats.npz
345
+ tts: fastspeech2
346
+ tts_conf:
347
+ adim: 384
348
+ aheads: 2
349
+ elayers: 4
350
+ eunits: 1536
351
+ dlayers: 4
352
+ dunits: 1536
353
+ positionwise_layer_type: conv1d
354
+ positionwise_conv_kernel_size: 3
355
+ duration_predictor_layers: 2
356
+ duration_predictor_chans: 256
357
+ duration_predictor_kernel_size: 3
358
+ postnet_layers: 5
359
+ postnet_filts: 5
360
+ postnet_chans: 256
361
+ use_masking: true
362
+ encoder_normalize_before: false
363
+ decoder_normalize_before: false
364
+ reduction_factor: 1
365
+ encoder_type: conformer
366
+ decoder_type: conformer
367
+ conformer_pos_enc_layer_type: rel_pos
368
+ conformer_self_attn_layer_type: rel_selfattn
369
+ conformer_activation_type: swish
370
+ use_macaron_style_in_conformer: true
371
+ use_cnn_in_conformer: true
372
+ conformer_enc_kernel_size: 7
373
+ conformer_dec_kernel_size: 31
374
+ init_type: xavier_uniform
375
+ transformer_enc_dropout_rate: 0.2
376
+ transformer_enc_positional_dropout_rate: 0.2
377
+ transformer_enc_attn_dropout_rate: 0.2
378
+ transformer_dec_dropout_rate: 0.2
379
+ transformer_dec_positional_dropout_rate: 0.2
380
+ transformer_dec_attn_dropout_rate: 0.2
381
+ pitch_predictor_layers: 5
382
+ pitch_predictor_chans: 256
383
+ pitch_predictor_kernel_size: 5
384
+ pitch_predictor_dropout: 0.5
385
+ pitch_embed_kernel_size: 1
386
+ pitch_embed_dropout: 0.0
387
+ stop_gradient_from_pitch_predictor: true
388
+ energy_predictor_layers: 2
389
+ energy_predictor_chans: 256
390
+ energy_predictor_kernel_size: 3
391
+ energy_predictor_dropout: 0.5
392
+ energy_embed_kernel_size: 1
393
+ energy_embed_dropout: 0.0
394
+ stop_gradient_from_energy_predictor: false
395
+ pitch_extract: dio
396
+ pitch_extract_conf:
397
+ fs: 24000
398
+ n_fft: 2048
399
+ hop_length: 300
400
+ f0max: 400
401
+ f0min: 80
402
+ pitch_normalize: global_mvn
403
+ pitch_normalize_conf:
404
+ stats_file: exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/pitch_stats.npz
405
+ energy_extract: energy
406
+ energy_extract_conf:
407
+ fs: 24000
408
+ n_fft: 2048
409
+ hop_length: 300
410
+ win_length: 1200
411
+ energy_normalize: global_mvn
412
+ energy_normalize_conf:
413
+ stats_file: exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/energy_stats.npz
414
+ required:
415
+ - output_dir
416
+ - token_list
417
+ distributed: false
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/backward_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/duration_loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/energy_loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/forward_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/iter_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/l1_loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/lr_0.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/optim_step_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/pitch_loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/images/train_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/train.loss.ave_5best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c238de4a41024ab2c909db6defa3d63a8da7f822f116d84ba4412f8498b2641b
3
+ size 281767733
exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/energy_stats.npz ADDED
Binary file (770 Bytes). View file
 
exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/feats_stats.npz ADDED
Binary file (1.4 kB). View file
 
exp/tts_train_tacotron2_raw_phn_pypinyin_g2p_phone/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/pitch_stats.npz ADDED
Binary file (770 Bytes). View file
 
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: 0.8.0
2
+ files:
3
+ model_file: exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/train.loss.ave_5best.pth
4
+ python: "3.7.3 (default, Mar 27 2019, 22:11:17) \n[GCC 7.3.0]"
5
+ timestamp: 1600227583.017457
6
+ torch: 1.6.0
7
+ yaml_files:
8
+ train_config: exp/tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone/config.yaml