wr commited on
Commit
0233e7e
1 Parent(s): 604eca0

add manifest and pretrained vocoders

Browse files
README.md CHANGED
@@ -1,3 +1,48 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - speech
5
+ - text
6
+ - cross-modal
7
+ - unified model
8
+ - self-supervised learning
9
+ - SpeechT5
10
+ - Voice Conversion
11
+ datasets:
12
+ - CMU ARCTIC
13
+ - bdl
14
+ - clb
15
+ - rms
16
+ - slt
17
  ---
18
+
19
+ ## SpeechT5 TTS Manifest
20
+
21
+ | [**Github**](https://github.com/microsoft/SpeechT5) | [**Huggingface**](https://huggingface.co/mechanicalsea/speecht5-vc) |
22
+
23
+ This manifest is an attempt to recreate the Voice Conversion recipe used for training [SpeechT5](https://aclanthology.org/2022.acl-long.393). This manifest was constructed using [CMU ARCTIC](http://www.festvox.org/cmu_arctic/) four speakers, e.g., bdl, clb, rms, slt. There are 932 utterances for training, 100 utterances for validation, and 100 utterance for evaluation.
24
+
25
+ ### Requirements
26
+
27
+ - [SpeechBrain](https://github.com/speechbrain/speechbrain) for extracting speaker embedding
28
+ - [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) for implementing vocoder.
29
+
30
+ ### Tools
31
+
32
+ - [manifest/utils](./manifest/utils/) is used to extract speaker embedding, generate manifest, and apply vocoder.
33
+ - [manifest/arctic*](./manifest/) provides the pre-trained vocoder for each speaker.
34
+
35
+ ### Reference
36
+
37
+ If you find our work is useful in your research, please cite the following paper:
38
+
39
+ ```bibtex
40
+ @inproceedings{ao-etal-2022-speecht5,
41
+ title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
42
+ author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
43
+ booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
44
+ month = {May},
45
+ year = {2022},
46
+ pages={5723--5738},
47
+ }
48
+ ```
manifest/.DS_Store ADDED
Binary file (8.2 kB). View file
manifest/arctic_bdl_parallel_wavegan.v1/.DS_Store ADDED
Binary file (6.15 kB). View file
manifest/arctic_bdl_parallel_wavegan.v1/config.yml ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ allow_cache: true
2
+ batch_max_steps: 15360
3
+ batch_size: 10
4
+ config: conf/parallel_wavegan.v1.yaml
5
+ dev_dumpdir: dump/dev_bdl/norm
6
+ dev_feats_scp: null
7
+ dev_segments: null
8
+ dev_wav_scp: null
9
+ discriminator_grad_norm: 1
10
+ discriminator_optimizer_params:
11
+ eps: 1.0e-06
12
+ lr: 5.0e-05
13
+ weight_decay: 0.0
14
+ discriminator_params:
15
+ bias: true
16
+ conv_channels: 64
17
+ in_channels: 1
18
+ kernel_size: 3
19
+ layers: 10
20
+ nonlinear_activation: LeakyReLU
21
+ nonlinear_activation_params:
22
+ negative_slope: 0.2
23
+ out_channels: 1
24
+ use_weight_norm: true
25
+ discriminator_scheduler_params:
26
+ gamma: 0.5
27
+ step_size: 200000
28
+ discriminator_train_start_steps: 100000
29
+ distributed: false
30
+ eval_interval_steps: 1000
31
+ fft_size: 1024
32
+ fmax: 7600
33
+ fmin: 80
34
+ format: npy
35
+ generator_grad_norm: 10
36
+ generator_optimizer_params:
37
+ eps: 1.0e-06
38
+ lr: 0.0001
39
+ weight_decay: 0.0
40
+ generator_params:
41
+ aux_channels: 80
42
+ aux_context_window: 2
43
+ dropout: 0.0
44
+ gate_channels: 128
45
+ in_channels: 1
46
+ kernel_size: 3
47
+ layers: 30
48
+ out_channels: 1
49
+ residual_channels: 64
50
+ skip_channels: 64
51
+ stacks: 3
52
+ upsample_net: ConvInUpsampleNetwork
53
+ upsample_params:
54
+ upsample_scales:
55
+ - 4
56
+ - 4
57
+ - 4
58
+ - 4
59
+ use_weight_norm: true
60
+ generator_scheduler_params:
61
+ gamma: 0.5
62
+ step_size: 200000
63
+ global_gain_scale: 1.0
64
+ hop_size: 256
65
+ lambda_adv: 4.0
66
+ log_interval_steps: 100
67
+ num_mels: 80
68
+ num_save_intermediate_results: 4
69
+ num_workers: 2
70
+ outdir: exp/train_nodev_bdl_arctic_parallel_wavegan.v1
71
+ pin_memory: true
72
+ pretrain: ''
73
+ rank: 0
74
+ remove_short_samples: true
75
+ resume: /mnt/default/v-junyiao/vc_vocoder2/train_nodev_bdl_arctic_parallel_wavegan.v1/checkpoint-135000steps.pkl
76
+ sampling_rate: 16000
77
+ save_interval_steps: 5000
78
+ stft_loss_params:
79
+ fft_sizes:
80
+ - 1024
81
+ - 2048
82
+ - 512
83
+ hop_sizes:
84
+ - 120
85
+ - 240
86
+ - 50
87
+ win_lengths:
88
+ - 600
89
+ - 1200
90
+ - 240
91
+ window: hann_window
92
+ train_dumpdir: dump/train_nodev_bdl/norm
93
+ train_feats_scp: null
94
+ train_max_steps: 400000
95
+ train_segments: null
96
+ train_wav_scp: null
97
+ trim_frame_size: 2048
98
+ trim_hop_size: 512
99
+ trim_silence: false
100
+ trim_threshold_in_db: 60
101
+ verbose: 1
102
+ version: 0.4.8
103
+ win_length: null
104
+ window: hann
manifest/arctic_bdl_parallel_wavegan.v1/pwg-arctic-bdl-400000steps.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f92557c6c61c2acc3a7f74533b291f03eae891963adee06d2e901922886c803c
3
+ size 5918653
manifest/arctic_bdl_parallel_wavegan.v1/stats.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c186bca19c4ed7bc4d93dd7aacd3db9d8ca6186fd5d5e8d64b7b19cde03637c
3
+ size 768
manifest/arctic_clb_parallel_wavegan.v1/.DS_Store ADDED
Binary file (6.15 kB). View file
manifest/arctic_clb_parallel_wavegan.v1/config.yml ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ allow_cache: true
2
+ batch_max_steps: 15360
3
+ batch_size: 10
4
+ config: conf/parallel_wavegan.v1.yaml
5
+ dev_dumpdir: dump/dev_clb/norm
6
+ dev_feats_scp: null
7
+ dev_segments: null
8
+ dev_wav_scp: null
9
+ discriminator_grad_norm: 1
10
+ discriminator_optimizer_params:
11
+ eps: 1.0e-06
12
+ lr: 5.0e-05
13
+ weight_decay: 0.0
14
+ discriminator_params:
15
+ bias: true
16
+ conv_channels: 64
17
+ in_channels: 1
18
+ kernel_size: 3
19
+ layers: 10
20
+ nonlinear_activation: LeakyReLU
21
+ nonlinear_activation_params:
22
+ negative_slope: 0.2
23
+ out_channels: 1
24
+ use_weight_norm: true
25
+ discriminator_scheduler_params:
26
+ gamma: 0.5
27
+ step_size: 200000
28
+ discriminator_train_start_steps: 100000
29
+ distributed: false
30
+ eval_interval_steps: 1000
31
+ fft_size: 1024
32
+ fmax: 7600
33
+ fmin: 80
34
+ format: npy
35
+ generator_grad_norm: 10
36
+ generator_optimizer_params:
37
+ eps: 1.0e-06
38
+ lr: 0.0001
39
+ weight_decay: 0.0
40
+ generator_params:
41
+ aux_channels: 80
42
+ aux_context_window: 2
43
+ dropout: 0.0
44
+ gate_channels: 128
45
+ in_channels: 1
46
+ kernel_size: 3
47
+ layers: 30
48
+ out_channels: 1
49
+ residual_channels: 64
50
+ skip_channels: 64
51
+ stacks: 3
52
+ upsample_net: ConvInUpsampleNetwork
53
+ upsample_params:
54
+ upsample_scales:
55
+ - 4
56
+ - 4
57
+ - 4
58
+ - 4
59
+ use_weight_norm: true
60
+ generator_scheduler_params:
61
+ gamma: 0.5
62
+ step_size: 200000
63
+ global_gain_scale: 1.0
64
+ hop_size: 256
65
+ lambda_adv: 4.0
66
+ log_interval_steps: 100
67
+ num_mels: 80
68
+ num_save_intermediate_results: 4
69
+ num_workers: 2
70
+ outdir: exp/train_nodev_clb_arctic_parallel_wavegan.v1
71
+ pin_memory: true
72
+ pretrain: ''
73
+ rank: 0
74
+ remove_short_samples: true
75
+ resume: /mnt/default/v-junyiao/vc_vocoder2/train_nodev_clb_arctic_parallel_wavegan.v1/checkpoint-135000steps.pkl
76
+ sampling_rate: 16000
77
+ save_interval_steps: 5000
78
+ stft_loss_params:
79
+ fft_sizes:
80
+ - 1024
81
+ - 2048
82
+ - 512
83
+ hop_sizes:
84
+ - 120
85
+ - 240
86
+ - 50
87
+ win_lengths:
88
+ - 600
89
+ - 1200
90
+ - 240
91
+ window: hann_window
92
+ train_dumpdir: dump/train_nodev_clb/norm
93
+ train_feats_scp: null
94
+ train_max_steps: 400000
95
+ train_segments: null
96
+ train_wav_scp: null
97
+ trim_frame_size: 2048
98
+ trim_hop_size: 512
99
+ trim_silence: false
100
+ trim_threshold_in_db: 60
101
+ verbose: 1
102
+ version: 0.4.8
103
+ win_length: null
104
+ window: hann
manifest/arctic_clb_parallel_wavegan.v1/pwg-arctic-clb-400000steps.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e80e448926a2b5b38de076fa8cc9e38589712d95ed08705bc7f242910c15ec4e
3
+ size 5918653
manifest/arctic_clb_parallel_wavegan.v1/stats.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23ef7d65275668849dc7c5bb876d78b8e3657f5e1ca299b76eb3ca6ce9c2370e
3
+ size 768
manifest/arctic_rms_parallel_wavegan.v1/.DS_Store ADDED
Binary file (6.15 kB). View file
manifest/arctic_rms_parallel_wavegan.v1/config.yml ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ allow_cache: true
2
+ batch_max_steps: 15360
3
+ batch_size: 10
4
+ config: conf/parallel_wavegan.v1.yaml
5
+ dev_dumpdir: dump/dev_rms/norm
6
+ dev_feats_scp: null
7
+ dev_segments: null
8
+ dev_wav_scp: null
9
+ discriminator_grad_norm: 1
10
+ discriminator_optimizer_params:
11
+ eps: 1.0e-06
12
+ lr: 5.0e-05
13
+ weight_decay: 0.0
14
+ discriminator_params:
15
+ bias: true
16
+ conv_channels: 64
17
+ in_channels: 1
18
+ kernel_size: 3
19
+ layers: 10
20
+ nonlinear_activation: LeakyReLU
21
+ nonlinear_activation_params:
22
+ negative_slope: 0.2
23
+ out_channels: 1
24
+ use_weight_norm: true
25
+ discriminator_scheduler_params:
26
+ gamma: 0.5
27
+ step_size: 200000
28
+ discriminator_train_start_steps: 100000
29
+ distributed: false
30
+ eval_interval_steps: 1000
31
+ fft_size: 1024
32
+ fmax: 7600
33
+ fmin: 80
34
+ format: npy
35
+ generator_grad_norm: 10
36
+ generator_optimizer_params:
37
+ eps: 1.0e-06
38
+ lr: 0.0001
39
+ weight_decay: 0.0
40
+ generator_params:
41
+ aux_channels: 80
42
+ aux_context_window: 2
43
+ dropout: 0.0
44
+ gate_channels: 128
45
+ in_channels: 1
46
+ kernel_size: 3
47
+ layers: 30
48
+ out_channels: 1
49
+ residual_channels: 64
50
+ skip_channels: 64
51
+ stacks: 3
52
+ upsample_net: ConvInUpsampleNetwork
53
+ upsample_params:
54
+ upsample_scales:
55
+ - 4
56
+ - 4
57
+ - 4
58
+ - 4
59
+ use_weight_norm: true
60
+ generator_scheduler_params:
61
+ gamma: 0.5
62
+ step_size: 200000
63
+ global_gain_scale: 1.0
64
+ hop_size: 256
65
+ lambda_adv: 4.0
66
+ log_interval_steps: 100
67
+ num_mels: 80
68
+ num_save_intermediate_results: 4
69
+ num_workers: 2
70
+ outdir: exp/train_nodev_rms_arctic_parallel_wavegan.v1
71
+ pin_memory: true
72
+ pretrain: ''
73
+ rank: 0
74
+ remove_short_samples: true
75
+ resume: /mnt/default/v-junyiao/vc_vocoder2/train_nodev_rms_arctic_parallel_wavegan.v1/checkpoint-110000steps.pkl
76
+ sampling_rate: 16000
77
+ save_interval_steps: 5000
78
+ stft_loss_params:
79
+ fft_sizes:
80
+ - 1024
81
+ - 2048
82
+ - 512
83
+ hop_sizes:
84
+ - 120
85
+ - 240
86
+ - 50
87
+ win_lengths:
88
+ - 600
89
+ - 1200
90
+ - 240
91
+ window: hann_window
92
+ train_dumpdir: dump/train_nodev_rms/norm
93
+ train_feats_scp: null
94
+ train_max_steps: 400000
95
+ train_segments: null
96
+ train_wav_scp: null
97
+ trim_frame_size: 2048
98
+ trim_hop_size: 512
99
+ trim_silence: false
100
+ trim_threshold_in_db: 60
101
+ verbose: 1
102
+ version: 0.4.8
103
+ win_length: null
104
+ window: hann
manifest/arctic_rms_parallel_wavegan.v1/pwg-arctic-rms-400000steps.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d70ed1c03eada2e8616731292a885e9bbb8406f5859afee5003704725f23d876
3
+ size 5918653
manifest/arctic_rms_parallel_wavegan.v1/stats.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3332906cb47d19988579ddb6c513a7f5fd3bb4ba3b1704c1327e11726a47cac8
3
+ size 768
manifest/arctic_slt_parallel_wavegan.v1/.DS_Store ADDED
Binary file (6.15 kB). View file
manifest/arctic_slt_parallel_wavegan.v1/config.yml ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ batch_max_steps: 15360
2
+ batch_size: 10
3
+ config: conf/parallel_wavegan.v1.yaml
4
+ dev_dumpdir: dump/dev/norm
5
+ discriminator_grad_norm: 1
6
+ discriminator_optimizer_params:
7
+ eps: 1.0e-06
8
+ lr: 5.0e-05
9
+ weight_decay: 0.0
10
+ discriminator_params:
11
+ bias: true
12
+ conv_channels: 64
13
+ in_channels: 1
14
+ kernel_size: 3
15
+ layers: 10
16
+ nonlinear_activation: LeakyReLU
17
+ nonlinear_activation_params:
18
+ negative_slope: 0.2
19
+ out_channels: 1
20
+ use_weight_norm: true
21
+ discriminator_scheduler_params:
22
+ gamma: 0.5
23
+ step_size: 200000
24
+ discriminator_train_start_steps: 100000
25
+ eval_interval_steps: 1000
26
+ fft_size: 1024
27
+ fmax: 7600
28
+ fmin: 80
29
+ format: npy
30
+ # hdf5
31
+ generator_grad_norm: 10
32
+ generator_optimizer_params:
33
+ eps: 1.0e-06
34
+ lr: 0.0001
35
+ weight_decay: 0.0
36
+ generator_params:
37
+ aux_channels: 80
38
+ aux_context_window: 2
39
+ dropout: 0.0
40
+ gate_channels: 128
41
+ in_channels: 1
42
+ kernel_size: 3
43
+ layers: 30
44
+ out_channels: 1
45
+ residual_channels: 64
46
+ skip_channels: 64
47
+ stacks: 3
48
+ upsample_net: ConvInUpsampleNetwork
49
+ upsample_params:
50
+ upsample_scales:
51
+ - 4
52
+ - 4
53
+ - 4
54
+ - 4
55
+ use_weight_norm: true
56
+ generator_scheduler_params:
57
+ gamma: 0.5
58
+ step_size: 200000
59
+ global_gain_scale: 1.0
60
+ hop_size: 256
61
+ lambda_adv: 4.0
62
+ log_interval_steps: 100
63
+ num_mels: 80
64
+ num_save_intermediate_results: 4
65
+ num_workers: 8
66
+ outdir: exp/train_nodev_arctic_slt_parallel_wavegan.v1
67
+ pin_memory: true
68
+ remove_short_samples: true
69
+ resume: exp/train_nodev_arctic_slt_parallel_wavegan.v1/checkpoint-300000steps.pkl
70
+ sampling_rate: 16000
71
+ save_interval_steps: 5000
72
+ stft_loss_params:
73
+ fft_sizes:
74
+ - 1024
75
+ - 2048
76
+ - 512
77
+ hop_sizes:
78
+ - 120
79
+ - 240
80
+ - 50
81
+ win_lengths:
82
+ - 600
83
+ - 1200
84
+ - 240
85
+ window: hann_window
86
+ train_dumpdir: dump/train_nodev/norm
87
+ train_max_steps: 400000
88
+ trim_frame_size: 2048
89
+ trim_hop_size: 512
90
+ trim_silence: false
91
+ trim_threshold_in_db: 60
92
+ verbose: 0
93
+ win_length: null
94
+ window: hann
manifest/arctic_slt_parallel_wavegan.v1/pwg-arctic-slt-400000steps.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:477686935b56f0eed684de9a31fb0f35600e4ce84b81e488c2b850fd07e630db
3
+ size 5918525
manifest/arctic_slt_parallel_wavegan.v1/stats.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8af46bfcde0d79c2d3936e25fbc7b59fb5043f064fb9fa53cd2323c8ea64abe1
3
+ size 768
manifest/dict.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:036438c7cb5fc860b1d1066a3b111542515b1d4ac1f5a79a15a2322e8f79f402
3
+ size 309
manifest/test.tsv ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9126dfb852be724b1d595ea69dc2adf96eaf2dd5ee2fe113a30229de3539491c
3
+ size 170418
manifest/train.tsv ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:067e049d317083e49ae22c7f5582a28253c1b24ba7988cb95b362eb1938e3553
3
+ size 1588164
manifest/utils/cmu_arctic_manifest.py ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+
4
+ from torchaudio.datasets import CMUARCTIC
5
+ from tqdm import tqdm
6
+
7
+
8
+ SPLITS = {
9
+ "train": list(range( 0, 932)),
10
+ "valid": list(range( 932, 1032)),
11
+ "test": list(range(1032, 1132)),
12
+ }
13
+
14
+
15
+ def get_parser():
16
+ parser = argparse.ArgumentParser()
17
+ parser.add_argument(
18
+ "root", metavar="DIR", help="root directory containing wav files to index"
19
+ )
20
+ parser.add_argument(
21
+ "--dest", default=".", type=str, metavar="DIR", help="output directory"
22
+ )
23
+ parser.add_argument(
24
+ "--source", default="bdl,clb,slt,rms", type=str, help="Source voice from slt, clb, bdl, rms."
25
+ )
26
+ parser.add_argument(
27
+ "--target", default="bdl,clb,slt,rms", type=str, help="Target voice from slt, clb, bdl, rms."
28
+ )
29
+ parser.add_argument(
30
+ "--splits", default="932,100,100", type=str, help="Split of train,valid,test seperate by comma."
31
+ )
32
+ parser.add_argument(
33
+ "--wav-root", default=None, type=str, metavar="DIR", help="saved waveform root directory for tsv"
34
+ )
35
+ parser.add_argument(
36
+ "--spkemb-npy-dir", required=True, type=str, help="speaker embedding directory"
37
+ )
38
+ return parser
39
+
40
+ def main(args):
41
+ dest_dir = args.dest
42
+ wav_root = args.wav_root
43
+ if not os.path.exists(dest_dir):
44
+ os.makedirs(dest_dir)
45
+
46
+ source = args.source.split(",")
47
+ target = args.target.split(",")
48
+ spks = sorted(list(set(source + target)))
49
+ datasets = {}
50
+
51
+ datasets["slt"] = CMUARCTIC(args.root, url="slt", folder_in_archive="ARCTIC", download=False)
52
+ for spk in spks:
53
+ if spk != "slt":
54
+ datasets[spk] = CMUARCTIC(args.root, url=spk, folder_in_archive="ARCTIC", download=False)
55
+ datasets[spk]._walker = list(datasets["slt"]._walker) # some text sentences is missing
56
+ if "slt" not in spks:
57
+ del datasets["slt"]
58
+
59
+ num_splits = [int(n_split) for n_split in args.splits.split(',')]
60
+ assert sum(num_splits) == 1132, f"Missing utterances: {sum(num_splits)} != 1132"
61
+
62
+ tsv = {}
63
+ for split in SPLITS.keys():
64
+ tsv[split] = open(os.path.join(dest_dir, f"{split}.tsv"), "w")
65
+ print(wav_root, file=tsv[split])
66
+
67
+ for split, indices in SPLITS.items():
68
+ for i in tqdm(indices, desc=f"[{'-'.join(spks)}]tsv/wav/spk"):
69
+ for src_spk in source:
70
+ for tgt_spk in target:
71
+ if src_spk == tgt_spk: continue
72
+ # wav, sample_rate, utterance, utt_no
73
+ src_i = datasets[src_spk][i]
74
+ tgt_i = datasets[tgt_spk][i]
75
+ assert src_i[1] == tgt_i[1], f"{src_i[1]}-{tgt_i[1]}"
76
+ assert src_i[3] == tgt_i[3], f"{src_i[3]}-{tgt_i[3]}"
77
+ src_wav = os.path.join(os.path.basename(datasets[src_spk]._path), datasets[src_spk]._folder_audio, f"arctic_{src_i[3]}.wav")
78
+ src_nframes = src_i[0].shape[-1]
79
+ tgt_wav = os.path.join(os.path.basename(datasets[tgt_spk]._path), datasets[tgt_spk]._folder_audio, f"arctic_{tgt_i[3]}.wav")
80
+ tgt_nframes = tgt_i[0].shape[-1]
81
+ tgt_spkemb = os.path.join(args.spkemb_npy_dir, f"{os.path.basename(datasets[tgt_spk]._path)}-{datasets[tgt_spk]._folder_audio}-arctic_{tgt_i[3]}.npy")
82
+ print(f"{src_wav}\t{src_nframes}\t{tgt_wav}\t{tgt_nframes}\t{tgt_spkemb}", file=tsv[split])
83
+ for split in tsv.keys():
84
+ tsv[split].close()
85
+
86
+
87
+ if __name__ == "__main__":
88
+ parser = get_parser()
89
+ args = parser.parse_args()
90
+ main(args)
manifest/utils/make_tsv.sh ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # bash utils/make_tsv.sh /root/data/cmu_arctic/ /root/data/cmu_arctic/cmu_arctic_finetuning_meta /opt/tiger/ARCTIC
3
+ root=$1
4
+ dest=$2
5
+ wav_root=$3
6
+ spkemb_split=$4
7
+ if [ -z ${spkemb_split} ]; then
8
+ spkemb_split=spkrec-xvect
9
+ fi
10
+ python utils/cmu_arctic_manifest.py ${root} --dest ${dest} --wav-root ${wav_root} --spkemb-npy-dir ${spkemb_split}
manifest/utils/prep_cmu_arctic_spkemb.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import glob
3
+ import numpy
4
+ import argparse
5
+ import torchaudio
6
+ from speechbrain.pretrained import EncoderClassifier
7
+ import torch
8
+ from tqdm import tqdm
9
+ import torch.nn.functional as F
10
+
11
+ spk_model = {
12
+ "speechbrain/spkrec-xvect-voxceleb": 512,
13
+ "speechbrain/spkrec-ecapa-voxceleb": 192,
14
+ }
15
+
16
+ def f2embed(wav_file, classifier, size_embed):
17
+ signal, fs = torchaudio.load(wav_file)
18
+ assert fs == 16000, fs
19
+ with torch.no_grad():
20
+ embeddings = classifier.encode_batch(signal)
21
+ embeddings = F.normalize(embeddings, dim=2)
22
+ embeddings = embeddings.squeeze().cpu().numpy()
23
+ assert embeddings.shape[0] == size_embed, embeddings.shape[0]
24
+ return embeddings
25
+
26
+ def process(args):
27
+ wavlst = []
28
+ for split in args.splits.split(","):
29
+ wav_dir = os.path.join(args.arctic_root, split)
30
+ wavlst_split = glob.glob(os.path.join(wav_dir, "wav", "*.wav"))
31
+ print(f"{split} {len(wavlst_split)} utterances.")
32
+ wavlst.extend(wavlst_split)
33
+
34
+ spkemb_root = args.output_root
35
+ if not os.path.exists(spkemb_root):
36
+ print(f"Create speaker embedding directory: {spkemb_root}")
37
+ os.mkdir(spkemb_root)
38
+ device = "cuda" if torch.cuda.is_available() else "cpu"
39
+ classifier = EncoderClassifier.from_hparams(source=args.speaker_embed, run_opts={"device": device}, savedir=os.path.join('/tmp', args.speaker_embed))
40
+ size_embed = spk_model[args.speaker_embed]
41
+ for utt_i in tqdm(wavlst, total=len(wavlst), desc="Extract"):
42
+ # TODO rename speaker embedding
43
+ utt_id = "-".join(utt_i.split("/")[-3:]).replace(".wav", "")
44
+ utt_emb = f2embed(utt_i, classifier, size_embed)
45
+ numpy.save(os.path.join(spkemb_root, f"{utt_id}.npy"), utt_emb)
46
+
47
+ def main():
48
+ parser = argparse.ArgumentParser()
49
+ parser.add_argument("--arctic-root", "-i", required=True, type=str, help="LibriTTS root directory.")
50
+ parser.add_argument("--output-root", "-o", required=True, type=str, help="Output directory.")
51
+ parser.add_argument("--speaker-embed", "-s", type=str, required=True, choices=["speechbrain/spkrec-xvect-voxceleb", "speechbrain/spkrec-ecapa-voxceleb"],
52
+ help="Pretrained model for extracting speaker emebdding.")
53
+ parser.add_argument("--splits", type=str, help="Split of four speakers seperate by comma.",
54
+ default="cmu_us_bdl_arctic,cmu_us_clb_arctic,cmu_us_rms_arctic,cmu_us_slt_arctic")
55
+ args = parser.parse_args()
56
+ print(f"Loading utterances from {args.arctic_root}/{args.splits}, "
57
+ + f"Save speaker embedding 'npy' to {args.output_root}, "
58
+ + f"Using speaker model {args.speaker_embed} with {spk_model[args.speaker_embed]} size.")
59
+ process(args)
60
+
61
+ if __name__ == "__main__":
62
+ """
63
+ python utils/prep_cmu_arctic_spkemb.py \
64
+ -i /root/data/cmu_arctic/CMUARCTIC \
65
+ -o /root/data/cmu_arctic/CMUARCTIC/spkrec-xvect \
66
+ -s speechbrain/spkrec-xvect-voxceleb
67
+ """
68
+ main()
manifest/utils/spec2wav.sh ADDED
File without changes
manifest/valid.tsv ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a0d3fc2569593894864f881f2027c46b9ea39fcb01f0e6cdbacc8213dfa8dd6f
3
+ size 170418