espnet
/

kan-bayashi_vctk_tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave

Siddhant commited on Oct 23, 2021

Commit

1015d84

•

1 Parent(s): 1076861

import from zenodo

Browse files

Files changed (27) hide show

README.md +50 -0
dump/raw/org/tr_no_dev/spk2sid +109 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/config.yaml +390 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_backward_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_fake_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_forward_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_optim_step_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_real_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_train_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_adv_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_backward_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_dur_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_feat_match_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_forward_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_kl_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_mel_loss.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_optim_step_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_train_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/gpu_max_cached_mem_GB.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/iter_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/optim0_lr0.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/optim1_lr0.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/train_time.png +0 -0
exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/train.total_count.ave_10best.pth +3 -0
meta.yaml +8 -0

README.md ADDED Viewed

	@@ -0,0 +1,50 @@

+---
+tags:
+- espnet
+- audio
+- text-to-speech
+language: en
+datasets:
+- vctk
+license: cc-by-4.0
+---
+## ESPnet2 TTS pretrained model
+### `kan-bayashi/vctk_tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave`
+♻️ Imported from https://zenodo.org/record/5500759/
+This model was trained by kan-bayashi using vctk/tts1 recipe in [espnet](https://github.com/espnet/espnet/).
+### Demo: How to use in ESPnet2
+```python
+# coming soon
+```
+### Citing ESPnet
+```BibTex
+@inproceedings{watanabe2018espnet,
+  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
+  title={{ESPnet}: End-to-End Speech Processing Toolkit},
+  year={2018},
+  booktitle={Proceedings of Interspeech},
+  pages={2207--2211},
+  doi={10.21437/Interspeech.2018-1456},
+  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
+}
+@inproceedings{hayashi2020espnet,
+  title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
+  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
+  booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={7654--7658},
+  year={2020},
+  organization={IEEE}
+}
+```
+or arXiv:
+```bibtex
+@misc{watanabe2018espnet,
+      title={ESPnet: End-to-End Speech Processing Toolkit},
+      author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Enrique Yalta Soplin and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
+      year={2018},
+      eprint={1804.00015},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```

dump/raw/org/tr_no_dev/spk2sid ADDED Viewed

	@@ -0,0 +1,109 @@

+<unk> 0
+p225 1
+p226 2
+p227 3
+p228 4
+p229 5
+p230 6
+p231 7
+p232 8
+p233 9
+p234 10
+p236 11
+p237 12
+p238 13
+p239 14
+p240 15
+p241 16
+p243 17
+p244 18
+p245 19
+p246 20
+p247 21
+p248 22
+p249 23
+p250 24
+p251 25
+p252 26
+p253 27
+p254 28
+p255 29
+p256 30
+p257 31
+p258 32
+p259 33
+p260 34
+p261 35
+p262 36
+p263 37
+p264 38
+p265 39
+p266 40
+p267 41
+p268 42
+p269 43
+p270 44
+p271 45
+p272 46
+p273 47
+p274 48
+p275 49
+p276 50
+p277 51
+p278 52
+p279 53
+p280 54
+p281 55
+p282 56
+p283 57
+p284 58
+p285 59
+p286 60
+p287 61
+p288 62
+p292 63
+p293 64
+p294 65
+p295 66
+p297 67
+p298 68
+p299 69
+p300 70
+p301 71
+p302 72
+p303 73
+p304 74
+p305 75
+p306 76
+p307 77
+p308 78
+p310 79
+p311 80
+p312 81
+p313 82
+p314 83
+p316 84
+p317 85
+p318 86
+p323 87
+p326 88
+p329 89
+p330 90
+p333 91
+p334 92
+p335 93
+p336 94
+p339 95
+p340 96
+p341 97
+p343 98
+p345 99
+p347 100
+p351 101
+p360 102
+p361 103
+p362 104
+p363 105
+p364 106
+p374 107
+p376 108

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/config.yaml ADDED Viewed

	@@ -0,0 +1,390 @@

+config: ./conf/tuning/train_multi_spk_vits.yaml
+print_config: false
+log_level: INFO
+dry_run: false
+iterator_type: sequence
+output_dir: exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space
+ngpu: 1
+seed: 777
+num_workers: 4
+num_att_plot: 3
+dist_backend: nccl
+dist_init_method: env://
+dist_world_size: 4
+dist_rank: 0
+local_rank: 0
+dist_master_addr: localhost
+dist_master_port: 39150
+dist_launcher: null
+multiprocessing_distributed: true
+unused_parameters: true
+sharded_ddp: false
+cudnn_enabled: true
+cudnn_benchmark: false
+cudnn_deterministic: false
+collect_stats: false
+write_collected_feats: false
+max_epoch: 2000
+patience: null
+val_scheduler_criterion:
+- valid
+- loss
+early_stopping_criterion:
+- valid
+- loss
+- min
+best_model_criterion:
+-   - train
+    - total_count
+    - max
+keep_nbest_models: 10
+grad_clip: -1
+grad_clip_type: 2.0
+grad_noise: false
+accum_grad: 1
+no_forward_run: false
+resume: true
+train_dtype: float32
+use_amp: false
+log_interval: 50
+use_tensorboard: true
+use_wandb: false
+wandb_project: null
+wandb_id: null
+wandb_entity: null
+wandb_name: null
+wandb_model_log_interval: -1
+detect_anomaly: false
+pretrain_path: null
+init_param: []
+ignore_init_mismatch: false
+freeze_param: []
+num_iters_per_epoch: 500
+batch_size: 20
+valid_batch_size: null
+batch_bins: 3000000
+valid_batch_bins: null
+train_shape_file:
+- exp/tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/train/text_shape.phn
+- exp/tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/train/speech_shape
+valid_shape_file:
+- exp/tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/valid/text_shape.phn
+- exp/tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/valid/speech_shape
+batch_type: numel
+valid_batch_type: null
+fold_length:
+- 150
+- 204800
+sort_in_batch: descending
+sort_batch: descending
+multiple_iterator: false
+chunk_length: 500
+chunk_shift_ratio: 0.5
+num_cache_chunks: 1024
+train_data_path_and_name_and_type:
+-   - dump/raw/tr_no_dev/text
+    - text
+    - text
+-   - dump/raw/tr_no_dev/wav.scp
+    - speech
+    - sound
+-   - dump/raw/tr_no_dev/utt2sid
+    - sids
+    - text_int
+valid_data_path_and_name_and_type:
+-   - dump/raw/dev/text
+    - text
+    - text
+-   - dump/raw/dev/wav.scp
+    - speech
+    - sound
+-   - dump/raw/dev/utt2sid
+    - sids
+    - text_int
+allow_variable_data_keys: false
+max_cache_size: 0.0
+max_cache_fd: 32
+valid_max_cache_size: null
+optim: adamw
+optim_conf:
+    lr: 0.0002
+    betas:
+    - 0.8
+    - 0.99
+    eps: 1.0e-09
+    weight_decay: 0.0
+scheduler: exponentiallr
+scheduler_conf:
+    gamma: 0.999875
+optim2: adamw
+optim2_conf:
+    lr: 0.0002
+    betas:
+    - 0.8
+    - 0.99
+    eps: 1.0e-09
+    weight_decay: 0.0
+scheduler2: exponentiallr
+scheduler2_conf:
+    gamma: 0.999875
+generator_first: false
+token_list:
+- <blank>
+- <unk>
+- AH0
+- T
+- N
+- S
+- R
+- IH1
+- D
+- L
+- .
+- Z
+- DH
+- K
+- W
+- M
+- AE1
+- EH1
+- AA1
+- IH0
+- IY1
+- AH1
+- B
+- P
+- V
+- ER0
+- F
+- HH
+- AY1
+- EY1
+- UW1
+- IY0
+- AO1
+- OW1
+- G
+- ','
+- NG
+- SH
+- Y
+- JH
+- AW1
+- UH1
+- TH
+- ER1
+- CH
+- '?'
+- OW0
+- OW2
+- EH2
+- EY2
+- UW0
+- IH2
+- OY1
+- AY2
+- ZH
+- AW2
+- EH0
+- IY2
+- AA2
+- AE0
+- AH2
+- AE2
+- AO0
+- AO2
+- AY0
+- UW2
+- UH2
+- AA0
+- AW0
+- EY0
+- '!'
+- UH0
+- ER2
+- OY2
+- ''''
+- OY0
+- <sos/eos>
+odim: null
+model_conf: {}
+use_preprocessor: true
+token_type: phn
+bpemodel: null
+non_linguistic_symbols: null
+cleaner: tacotron
+g2p: g2p_en_no_space
+feats_extract: linear_spectrogram
+feats_extract_conf:
+    n_fft: 1024
+    hop_length: 256
+    win_length: null
+normalize: null
+normalize_conf: {}
+tts: vits
+tts_conf:
+    generator_type: vits_generator
+    generator_params:
+        hidden_channels: 192
+        spks: 128
+        global_channels: 256
+        segment_size: 32
+        text_encoder_attention_heads: 2
+        text_encoder_ffn_expand: 4
+        text_encoder_blocks: 6
+        text_encoder_positionwise_layer_type: conv1d
+        text_encoder_positionwise_conv_kernel_size: 3
+        text_encoder_positional_encoding_layer_type: rel_pos
+        text_encoder_self_attention_layer_type: rel_selfattn
+        text_encoder_activation_type: swish
+        text_encoder_normalize_before: true
+        text_encoder_dropout_rate: 0.1
+        text_encoder_positional_dropout_rate: 0.0
+        text_encoder_attention_dropout_rate: 0.1
+        use_macaron_style_in_text_encoder: true
+        use_conformer_conv_in_text_encoder: false
+        text_encoder_conformer_kernel_size: -1
+        decoder_kernel_size: 7
+        decoder_channels: 512
+        decoder_upsample_scales:
+        - 8
+        - 8
+        - 2
+        - 2
+        decoder_upsample_kernel_sizes:
+        - 16
+        - 16
+        - 4
+        - 4
+        decoder_resblock_kernel_sizes:
+        - 3
+        - 7
+        - 11
+        decoder_resblock_dilations:
+        -   - 1
+            - 3
+            - 5
+        -   - 1
+            - 3
+            - 5
+        -   - 1
+            - 3
+            - 5
+        use_weight_norm_in_decoder: true
+        posterior_encoder_kernel_size: 5
+        posterior_encoder_layers: 16
+        posterior_encoder_stacks: 1
+        posterior_encoder_base_dilation: 1
+        posterior_encoder_dropout_rate: 0.0
+        use_weight_norm_in_posterior_encoder: true
+        flow_flows: 4
+        flow_kernel_size: 5
+        flow_base_dilation: 1
+        flow_layers: 4
+        flow_dropout_rate: 0.0
+        use_weight_norm_in_flow: true
+        use_only_mean_in_flow: true
+        stochastic_duration_predictor_kernel_size: 3
+        stochastic_duration_predictor_dropout_rate: 0.5
+        stochastic_duration_predictor_flows: 4
+        stochastic_duration_predictor_dds_conv_layers: 3
+        vocabs: 77
+        aux_channels: 513
+    discriminator_type: hifigan_multi_scale_multi_period_discriminator
+    discriminator_params:
+        scales: 1
+        scale_downsample_pooling: AvgPool1d
+        scale_downsample_pooling_params:
+            kernel_size: 4
+            stride: 2
+            padding: 2
+        scale_discriminator_params:
+            in_channels: 1
+            out_channels: 1
+            kernel_sizes:
+            - 15
+            - 41
+            - 5
+            - 3
+            channels: 128
+            max_downsample_channels: 1024
+            max_groups: 16
+            bias: true
+            downsample_scales:
+            - 2
+            - 2
+            - 4
+            - 4
+            - 1
+            nonlinear_activation: LeakyReLU
+            nonlinear_activation_params:
+                negative_slope: 0.1
+            use_weight_norm: true
+            use_spectral_norm: false
+        follow_official_norm: false
+        periods:
+        - 2
+        - 3
+        - 5
+        - 7
+        - 11
+        period_discriminator_params:
+            in_channels: 1
+            out_channels: 1
+            kernel_sizes:
+            - 5
+            - 3
+            channels: 32
+            downsample_scales:
+            - 3
+            - 3
+            - 3
+            - 3
+            - 1
+            max_downsample_channels: 1024
+            bias: true
+            nonlinear_activation: LeakyReLU
+            nonlinear_activation_params:
+                negative_slope: 0.1
+            use_weight_norm: true
+            use_spectral_norm: false
+    generator_adv_loss_params:
+        average_by_discriminators: false
+        loss_type: mse
+    discriminator_adv_loss_params:
+        average_by_discriminators: false
+        loss_type: mse
+    feat_match_loss_params:
+        average_by_discriminators: false
+        average_by_layers: false
+        include_final_outputs: true
+    mel_loss_params:
+        fs: 22050
+        n_fft: 1024
+        hop_length: 256
+        win_length: null
+        window: hann
+        n_mels: 80
+        fmin: 0
+        fmax: null
+        log_base: null
+    lambda_adv: 1.0
+    lambda_mel: 45.0
+    lambda_feat_match: 2.0
+    lambda_dur: 1.0
+    lambda_kl: 1.0
+    sampling_rate: 22050
+    cache_generator_outputs: true
+pitch_extract: null
+pitch_extract_conf: {}
+pitch_normalize: null
+pitch_normalize_conf: {}
+energy_extract: null
+energy_extract_conf: {}
+energy_normalize: null
+energy_normalize_conf: {}
+required:
+- output_dir
+- token_list
+version: 0.10.3a1
+distributed: true

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_backward_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_fake_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_forward_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_optim_step_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_real_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_train_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_adv_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_backward_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_dur_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_feat_match_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_forward_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_kl_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_mel_loss.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_optim_step_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_train_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/gpu_max_cached_mem_GB.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/iter_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/optim0_lr0.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/optim1_lr0.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/train_time.png ADDED Viewed

exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/train.total_count.ave_10best.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1906649e53770718880562149394be11177af79f8fa59b121168155af763af2c
+size 386076485

meta.yaml ADDED Viewed

	@@ -0,0 +1,8 @@

+espnet: 0.10.3a2
+files:
+  model_file: exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/train.total_count.ave_10best.pth
+python: "3.7.3 (default, Mar 27 2019, 22:11:17) \n[GCC 7.3.0]"
+timestamp: 1631321259.887765
+torch: 1.7.1
+yaml_files:
+  train_config: exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/config.yaml