Siddhant commited on
Commit
c5262b3
1 Parent(s): f41eaf3

import from zenodo

Browse files
Files changed (18) hide show
  1. README.md +50 -0
  2. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/config.yaml +288 -0
  3. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/backward_time.png +0 -0
  4. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/duration_loss.png +0 -0
  5. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/energy_loss.png +0 -0
  6. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/forward_time.png +0 -0
  7. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/iter_time.png +0 -0
  8. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/l1_loss.png +0 -0
  9. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/loss.png +0 -0
  10. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/lr_0.png +0 -0
  11. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/optim_step_time.png +0 -0
  12. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/pitch_loss.png +0 -0
  13. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/train_time.png +0 -0
  14. exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/train.loss.ave_5best.pth +3 -0
  15. exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/energy_stats.npz +0 -0
  16. exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/feats_stats.npz +0 -0
  17. exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/pitch_stats.npz +0 -0
  18. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - text-to-speech
6
+ language: en
7
+ datasets:
8
+ - ljspeech
9
+ license: cc-by-4.0
10
+ ---
11
+ ## Example ESPnet2 TTS model
12
+ ### `kan-bayashi/ljspeech_conformer_fastspeech2`
13
+ ♻️ Imported from https://zenodo.org/record/4036268/
14
+
15
+ This model was trained by kan-bayashi using ljspeech/tts1 recipe in [espnet](https://github.com/espnet/espnet/).
16
+ ### Demo: How to use in ESPnet2
17
+ ```python
18
+ # coming soon
19
+ ```
20
+ ### Citing ESPnet
21
+ ```BibTex
22
+ @inproceedings{watanabe2018espnet,
23
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
24
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
25
+ year={2018},
26
+ booktitle={Proceedings of Interspeech},
27
+ pages={2207--2211},
28
+ doi={10.21437/Interspeech.2018-1456},
29
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
30
+ }
31
+ @inproceedings{hayashi2020espnet,
32
+ title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
33
+ author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
34
+ booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
35
+ pages={7654--7658},
36
+ year={2020},
37
+ organization={IEEE}
38
+ }
39
+ ```
40
+ or arXiv:
41
+ ```bibtex
42
+ @misc{watanabe2018espnet,
43
+ title={ESPnet: End-to-End Speech Processing Toolkit},
44
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Enrique Yalta Soplin and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
45
+ year={2018},
46
+ eprint={1804.00015},
47
+ archivePrefix={arXiv},
48
+ primaryClass={cs.CL}
49
+ }
50
+ ```
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/config.yaml ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/tuning/train_conformer_fastspeech2.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ dry_run: false
5
+ iterator_type: sequence
6
+ output_dir: exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space
7
+ ngpu: 1
8
+ seed: 0
9
+ num_workers: 1
10
+ num_att_plot: 3
11
+ dist_backend: nccl
12
+ dist_init_method: env://
13
+ dist_world_size: null
14
+ dist_rank: null
15
+ local_rank: 0
16
+ dist_master_addr: null
17
+ dist_master_port: null
18
+ dist_launcher: null
19
+ multiprocessing_distributed: false
20
+ cudnn_enabled: true
21
+ cudnn_benchmark: false
22
+ cudnn_deterministic: true
23
+ collect_stats: false
24
+ write_collected_feats: false
25
+ max_epoch: 1000
26
+ patience: null
27
+ val_scheduler_criterion:
28
+ - valid
29
+ - loss
30
+ early_stopping_criterion:
31
+ - valid
32
+ - loss
33
+ - min
34
+ best_model_criterion:
35
+ - - valid
36
+ - loss
37
+ - min
38
+ - - train
39
+ - loss
40
+ - min
41
+ keep_nbest_models: 5
42
+ grad_clip: 1.0
43
+ grad_clip_type: 2.0
44
+ grad_noise: false
45
+ accum_grad: 10
46
+ no_forward_run: false
47
+ resume: true
48
+ train_dtype: float32
49
+ use_amp: false
50
+ log_interval: null
51
+ pretrain_path: []
52
+ pretrain_key: []
53
+ num_iters_per_epoch: 500
54
+ batch_size: 20
55
+ valid_batch_size: null
56
+ batch_bins: 2400000
57
+ valid_batch_bins: null
58
+ train_shape_file:
59
+ - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/text_shape.phn
60
+ - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/speech_shape
61
+ valid_shape_file:
62
+ - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/valid/text_shape.phn
63
+ - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/valid/speech_shape
64
+ batch_type: numel
65
+ valid_batch_type: null
66
+ fold_length:
67
+ - 150
68
+ - 204800
69
+ sort_in_batch: descending
70
+ sort_batch: descending
71
+ multiple_iterator: false
72
+ chunk_length: 500
73
+ chunk_shift_ratio: 0.5
74
+ num_cache_chunks: 1024
75
+ train_data_path_and_name_and_type:
76
+ - - dump/raw/tr_no_dev/text
77
+ - text
78
+ - text
79
+ - - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/tr_no_dev/durations
80
+ - durations
81
+ - text_int
82
+ - - dump/raw/tr_no_dev/wav.scp
83
+ - speech
84
+ - sound
85
+ - - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/collect_feats/pitch.scp
86
+ - pitch
87
+ - npy
88
+ - - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/collect_feats/energy.scp
89
+ - energy
90
+ - npy
91
+ valid_data_path_and_name_and_type:
92
+ - - dump/raw/dev/text
93
+ - text
94
+ - text
95
+ - - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/dev/durations
96
+ - durations
97
+ - text_int
98
+ - - dump/raw/dev/wav.scp
99
+ - speech
100
+ - sound
101
+ - - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/valid/collect_feats/pitch.scp
102
+ - pitch
103
+ - npy
104
+ - - exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/valid/collect_feats/energy.scp
105
+ - energy
106
+ - npy
107
+ allow_variable_data_keys: false
108
+ max_cache_size: 0.0
109
+ valid_max_cache_size: null
110
+ optim: adam
111
+ optim_conf:
112
+ lr: 1.0
113
+ scheduler: noamlr
114
+ scheduler_conf:
115
+ model_size: 384
116
+ warmup_steps: 4000
117
+ token_list:
118
+ - <blank>
119
+ - <unk>
120
+ - ..
121
+ - OY0
122
+ - UH0
123
+ - AW0
124
+ - '!'
125
+ - OY2
126
+ - '?'
127
+ - UH2
128
+ - ER2
129
+ - ''''
130
+ - AA0
131
+ - IY2
132
+ - AW2
133
+ - AY0
134
+ - AH2
135
+ - UW2
136
+ - AE0
137
+ - OW2
138
+ - ZH
139
+ - AO2
140
+ - EY0
141
+ - OY1
142
+ - EH0
143
+ - UW0
144
+ - AA2
145
+ - AY2
146
+ - AE2
147
+ - IH2
148
+ - AO0
149
+ - EY2
150
+ - OW0
151
+ - EH2
152
+ - UH1
153
+ - TH
154
+ - AW1
155
+ - Y
156
+ - JH
157
+ - CH
158
+ - ER1
159
+ - G
160
+ - NG
161
+ - SH
162
+ - OW1
163
+ - .
164
+ - AY1
165
+ - EY1
166
+ - AO1
167
+ - IY0
168
+ - UW1
169
+ - IY1
170
+ - HH
171
+ - B
172
+ - AA1
173
+ - ','
174
+ - F
175
+ - ER0
176
+ - V
177
+ - AH1
178
+ - AE1
179
+ - P
180
+ - W
181
+ - EH1
182
+ - M
183
+ - IH0
184
+ - IH1
185
+ - Z
186
+ - K
187
+ - DH
188
+ - L
189
+ - R
190
+ - S
191
+ - D
192
+ - T
193
+ - N
194
+ - AH0
195
+ - <sos/eos>
196
+ odim: null
197
+ model_conf: {}
198
+ use_preprocessor: true
199
+ token_type: phn
200
+ bpemodel: null
201
+ non_linguistic_symbols: null
202
+ cleaner: tacotron
203
+ g2p: g2p_en_no_space
204
+ feats_extract: fbank
205
+ feats_extract_conf:
206
+ fs: 22050
207
+ fmin: 80
208
+ fmax: 7600
209
+ n_mels: 80
210
+ hop_length: 256
211
+ n_fft: 1024
212
+ win_length: null
213
+ normalize: global_mvn
214
+ normalize_conf:
215
+ stats_file: exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/feats_stats.npz
216
+ tts: fastspeech2
217
+ tts_conf:
218
+ adim: 384
219
+ aheads: 2
220
+ elayers: 4
221
+ eunits: 1536
222
+ dlayers: 4
223
+ dunits: 1536
224
+ positionwise_layer_type: conv1d
225
+ positionwise_conv_kernel_size: 3
226
+ duration_predictor_layers: 2
227
+ duration_predictor_chans: 256
228
+ duration_predictor_kernel_size: 3
229
+ postnet_layers: 5
230
+ postnet_filts: 5
231
+ postnet_chans: 256
232
+ use_masking: true
233
+ encoder_normalize_before: false
234
+ decoder_normalize_before: false
235
+ reduction_factor: 1
236
+ encoder_type: conformer
237
+ decoder_type: conformer
238
+ conformer_pos_enc_layer_type: rel_pos
239
+ conformer_self_attn_layer_type: rel_selfattn
240
+ conformer_activation_type: swish
241
+ use_macaron_style_in_conformer: true
242
+ use_cnn_in_conformer: true
243
+ conformer_enc_kernel_size: 7
244
+ conformer_dec_kernel_size: 31
245
+ init_type: xavier_uniform
246
+ transformer_enc_dropout_rate: 0.2
247
+ transformer_enc_positional_dropout_rate: 0.2
248
+ transformer_enc_attn_dropout_rate: 0.2
249
+ transformer_dec_dropout_rate: 0.2
250
+ transformer_dec_positional_dropout_rate: 0.2
251
+ transformer_dec_attn_dropout_rate: 0.2
252
+ pitch_predictor_layers: 5
253
+ pitch_predictor_chans: 256
254
+ pitch_predictor_kernel_size: 5
255
+ pitch_predictor_dropout: 0.5
256
+ pitch_embed_kernel_size: 1
257
+ pitch_embed_dropout: 0.0
258
+ stop_gradient_from_pitch_predictor: true
259
+ energy_predictor_layers: 2
260
+ energy_predictor_chans: 256
261
+ energy_predictor_kernel_size: 3
262
+ energy_predictor_dropout: 0.5
263
+ energy_embed_kernel_size: 1
264
+ energy_embed_dropout: 0.0
265
+ stop_gradient_from_energy_predictor: false
266
+ pitch_extract: dio
267
+ pitch_extract_conf:
268
+ fs: 22050
269
+ n_fft: 1024
270
+ hop_length: 256
271
+ f0max: 400
272
+ f0min: 80
273
+ pitch_normalize: global_mvn
274
+ pitch_normalize_conf:
275
+ stats_file: exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/pitch_stats.npz
276
+ energy_extract: energy
277
+ energy_extract_conf:
278
+ fs: 22050
279
+ n_fft: 1024
280
+ hop_length: 256
281
+ win_length: null
282
+ energy_normalize: global_mvn
283
+ energy_normalize_conf:
284
+ stats_file: exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/energy_stats.npz
285
+ required:
286
+ - output_dir
287
+ - token_list
288
+ distributed: false
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/backward_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/duration_loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/energy_loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/forward_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/iter_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/l1_loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/lr_0.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/optim_step_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/pitch_loss.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/images/train_time.png ADDED
exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/train.loss.ave_5best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02b3f1af8dc9af8bee03999683092029d93dc21c6690044bc80b7ace59bec16a
3
+ size 281405749
exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/energy_stats.npz ADDED
Binary file (770 Bytes). View file
exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/feats_stats.npz ADDED
Binary file (1.4 kB). View file
exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/pitch_stats.npz ADDED
Binary file (770 Bytes). View file
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
1
+ espnet: 0.8.0
2
+ files:
3
+ model_file: exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/train.loss.ave_5best.pth
4
+ python: "3.7.3 (default, Mar 27 2019, 22:11:17) \n[GCC 7.3.0]"
5
+ timestamp: 1600432937.488829
6
+ torch: 1.6.0
7
+ yaml_files:
8
+ train_config: exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/config.yaml