wr commited on
Commit
31ad50e
1 Parent(s): f9fe32e

set *.tsv and *.txt to large file

Browse files
.gitattributes CHANGED
@@ -29,3 +29,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
29
  *.zip filter=lfs diff=lfs merge=lfs -text
30
  *.zst filter=lfs diff=lfs merge=lfs -text
31
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
29
  *.zip filter=lfs diff=lfs merge=lfs -text
30
  *.zst filter=lfs diff=lfs merge=lfs -text
31
  *tfevents* filter=lfs diff=lfs merge=lfs -text
32
+ *.txt filter=lfs diff=lfs merge=lfs -text
33
+ *.tsv filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,43 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - speech
5
+ - text
6
+ - cross-modal
7
+ - unified model
8
+ - self-supervised learning
9
+ - SpeechT5
10
+ datasets:
11
+ - LibriTTS
12
  ---
13
+
14
+ ## SpeechT5 TTS Manifest
15
+
16
+ | [**Github**](https://github.com/microsoft/SpeechT5) | [**Huggingface**](https://huggingface.co/mechanicalsea/speecht5-tts) |
17
+
18
+ This manifest is an attempt to recreate the Text-to-Speech recipe used for training [SpeechT5](https://aclanthology.org/2022.acl-long.393). This manifest was constructed using [LibriTTS](http://www.openslr.org/60/) clean datasets, including train-clean-100 and train-clean-360 for training, dev-clean for validation, and test-clean for evaluation. The test-clean-200 contains 200 utterances id for the mean option score (MOS), and the comparison mean option score (CMOS).
19
+
20
+ ### Requirements
21
+
22
+ - [SpeechBrain](https://github.com/speechbrain/speechbrain) for extracting speaker embedding
23
+ - [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) for implementing vocoder.
24
+
25
+ ### Tools
26
+
27
+ - [manifest/utils](./manifest/utils/) is used to downsample waveform, extract speaker embedding, generate manifest, and apply vocoder.
28
+ - [pretrained_vocoder](./pretrained_vocoder/) provides the pre-trained vocoder.
29
+
30
+ ### Reference
31
+
32
+ If you find our work is useful in your research, please cite the following paper:
33
+
34
+ ```bibtex
35
+ @inproceedings{ao-etal-2022-speecht5,
36
+ title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
37
+ author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
38
+ booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
39
+ month = {May},
40
+ year = {2022},
41
+ pages={5723--5738},
42
+ }
43
+ ```
manifest/TTS_examples.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c8e2db9c6294f35bd8952435aa506ebe38d5e7b5aebf01dee3e086f4d4f9685f
3
+ size 8018
manifest/dev-clean.tsv ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6cf77f21f3dab7dc8ca5e8470ee45f2ed1907304b05f1245f21febda73ea7d7
3
+ size 635339
manifest/dev-clean.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec6d57b715e17da05dc462846d9fd1309e2f10c844cf2cc8566807741905ccd7
3
+ size 548224
manifest/dict.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:036438c7cb5fc860b1d1066a3b111542515b1d4ac1f5a79a15a2322e8f79f402
3
+ size 309
manifest/spm_char.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7fcc48f3e225f627b1641db410ceb0c8649bd2b0c982e150b03f8be3728ab560
3
+ size 238473
manifest/test-clean-200.tsv ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b22354b2f305ba791d7efb72246a8ddb01cc832fcd1dcd123245faa9aa0a7931
3
+ size 22150
manifest/test-clean-200.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39431d3e311a3a47935411d819c94c4f28161022cdc426f0b7f3d9dc0be9c569
3
+ size 22526
manifest/test-clean.tsv ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:979bb2256a8138cf0492e2aa07628b815891bd0d81ac6a98d9d5d6889a176291
3
+ size 535922
manifest/test-clean.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c4470877fc16c4135723c4bfe0784d47f0211bf6b12088ec6d293bbf5e4fac1
3
+ size 508964
manifest/train-clean-100.tsv ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c93390d311316c02d6e7da4bf5ab0b93cb922f80b075f6dfc30ff14c33b33bf0
3
+ size 3864578
manifest/train-clean-100.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e40a9a117e7f588390bcb188ffad54830c37621a38d1e6e1f3f3f4e13885d863
3
+ size 3180343
manifest/train-clean-360.tsv ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d14e7dfea4e60753aa6b882ee64472cf340174ff707c1e0f69e590b4373676ba
3
+ size 13582849
manifest/train-clean-360.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c03d42d6310f67293b3010ee207da940e1ba03adf1924f1e5b959d9370f73037
3
+ size 11483749
manifest/utils/libritts_manifest.py ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ from typing import Tuple
4
+
5
+ from scipy.io import wavfile
6
+ from torchaudio.datasets import LIBRITTS
7
+ from tqdm import tqdm
8
+
9
+
10
+ def load_libritts_item(
11
+ fileid: str,
12
+ path: str,
13
+ ext_audio: str,
14
+ ext_original_txt: str,
15
+ ext_normalized_txt: str,
16
+ ) -> Tuple[int, int, str, str, int, int, str]:
17
+ speaker_id, chapter_id, segment_id, utterance_id = fileid.split("_")
18
+ utterance_id = fileid
19
+
20
+ normalized_text = utterance_id + ext_normalized_txt
21
+ normalized_text = os.path.join(path, speaker_id, chapter_id, normalized_text)
22
+
23
+ original_text = utterance_id + ext_original_txt
24
+ original_text = os.path.join(path, speaker_id, chapter_id, original_text)
25
+
26
+ file_audio = utterance_id + ext_audio
27
+ file_audio = os.path.join(path, speaker_id, chapter_id, file_audio)
28
+
29
+ # Load audio
30
+ sample_rate, wav = wavfile.read(file_audio)
31
+ n_frames = wav.shape[0]
32
+
33
+ # Load original text
34
+ # with open(original_text) as ft:
35
+ # original_text = ft.readline()
36
+
37
+ # Load normalized text
38
+ with open(normalized_text, "r") as ft:
39
+ normalized_text = ft.readline()
40
+
41
+ return (
42
+ n_frames,
43
+ sample_rate,
44
+ None,
45
+ normalized_text,
46
+ int(speaker_id),
47
+ int(chapter_id),
48
+ utterance_id,
49
+ )
50
+
51
+
52
+ class LIBRITTS_16K(LIBRITTS):
53
+ def __getitem__(self, n: int) -> Tuple[int, int, str, str, int, int, str]:
54
+ """Load the n-th sample from the dataset.
55
+
56
+ Args:
57
+ n (int): The index of the sample to be loaded
58
+
59
+ Returns:
60
+ (Tensor, int, str, str, str, int, int, str):
61
+ ``(waveform_length, sample_rate, original_text, normalized_text, speaker_id, chapter_id, utterance_id)``
62
+ """
63
+ fileid = self._walker[n]
64
+ return load_libritts_item(
65
+ fileid,
66
+ self._path,
67
+ self._ext_audio,
68
+ self._ext_original_txt,
69
+ self._ext_normalized_txt,
70
+ )
71
+
72
+
73
+ def get_parser():
74
+ parser = argparse.ArgumentParser()
75
+ parser.add_argument(
76
+ "root", metavar="DIR", help="root directory containing wav files to index"
77
+ )
78
+ parser.add_argument(
79
+ "--dest", default=".", type=str, metavar="DIR", help="output directory"
80
+ )
81
+ parser.add_argument(
82
+ "--split", required=True, type=str, help="dataset splits"
83
+ )
84
+ parser.add_argument(
85
+ "--wav-root", default=None, type=str, metavar="DIR", help="saved waveform root directory for tsv"
86
+ )
87
+ parser.add_argument(
88
+ "--spkemb-npy-dir", required=True, type=str, help="speaker embedding directory"
89
+ )
90
+ return parser
91
+
92
+ def main(args):
93
+ dest_dir = args.dest
94
+ wav_root = args.wav_root
95
+ if not os.path.exists(dest_dir):
96
+ os.makedirs(dest_dir)
97
+
98
+ dataset = LIBRITTS_16K(os.path.dirname(args.root), url=args.split, folder_in_archive=os.path.basename(args.root))
99
+ tsv_f = open(os.path.join(dest_dir, f"{args.split}.tsv"), "w")
100
+ txt_f = open(os.path.join(dest_dir, f"{args.split}.txt"), "w")
101
+ print(wav_root, file=tsv_f)
102
+
103
+ for n_frames, sr, ori_text, norm_text, spk_id, chap_id, utt_id in tqdm(dataset, desc="tsv/txt/wav"):
104
+ assert sr == 16000, f"sampling rate {sr} != 16000"
105
+ utt_file = os.path.join(args.split, f"{spk_id}", f"{chap_id}", f"{utt_id}.wav")
106
+ spk_file = os.path.join(args.spkemb_npy_dir, f"{spk_id}-{chap_id}-{utt_id}.npy")
107
+ assert os.path.exists(os.path.join(wav_root, utt_file))
108
+ assert os.path.exists(os.path.join(wav_root, spk_file))
109
+
110
+ print(f"{utt_file}\t{n_frames}\t{spk_file}", file=tsv_f)
111
+ print(norm_text, file=txt_f)
112
+
113
+ tsv_f.close()
114
+ txt_f.close()
115
+
116
+
117
+ if __name__ == "__main__":
118
+ parser = get_parser()
119
+ args = parser.parse_args()
120
+ main(args)
manifest/utils/make_tsv_txt.sh ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # bash utils/make_tsv_txt.sh /mnt/bn/wangrui2022/wangrui2022/libritts/LibriTTS_16k /opt/tiger/libritts_finetuning_meta /opt/tiger/LibriTTS_16k
3
+ root=$1
4
+ dest=$2
5
+ wav_root=$3
6
+ spkemb_split=$4
7
+ if [ -z ${spkemb_split} ]; then
8
+ spkemb_split=spkrec-xvect
9
+ fi
10
+ for split in dev-clean test-clean train-clean-100 train-clean-360; do
11
+ echo "making ${split}.tsv and ${split}.txt ..."
12
+ python utils/libritts_manifest.py ${root} --dest ${dest} --split ${split} --wav-root ${wav_root} --spkemb-npy-dir ${spkemb_split}
13
+ done
manifest/utils/prep_libritts_spkemb.py ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import glob
3
+ import numpy
4
+ import argparse
5
+ import torchaudio
6
+ from speechbrain.pretrained import EncoderClassifier
7
+ import torch
8
+ from tqdm import tqdm
9
+ import torch.nn.functional as F
10
+ import torchaudio.transforms as T
11
+
12
+ spk_model = {
13
+ "speechbrain/spkrec-xvect-voxceleb": 512,
14
+ "speechbrain/spkrec-ecapa-voxceleb": 192,
15
+ }
16
+
17
+ def f2embed(wav_file, classifier, size_embed, resampler=None):
18
+ signal, fs =torchaudio.load(wav_file)
19
+ if fs != 16000 and fs is not None:
20
+ assert fs == 24000, fs
21
+ signal = resampler(signal)
22
+ fs = 16000
23
+ assert fs == 16000, fs
24
+ with torch.no_grad():
25
+ embeddings = classifier.encode_batch(signal)
26
+ embeddings = F.normalize(embeddings, dim=2)
27
+ embeddings = embeddings.squeeze().cpu().numpy()
28
+ assert embeddings.shape[0] == size_embed, embeddings.shape[0]
29
+ return embeddings
30
+
31
+ def process(args):
32
+ wavlst = []
33
+ for split in args.splits.split(","):
34
+ wav_dir = os.path.join(args.libritts_root, split)
35
+ wavlst_split = glob.glob(os.path.join(wav_dir, "*", "*", "*.wav"))
36
+ print(f"{split} {len(wavlst_split)} utterances.")
37
+ wavlst.extend(wavlst_split)
38
+ spkemb_root = args.output_root
39
+ if not os.path.exists(spkemb_root):
40
+ print(f"Create speaker embedding directory: {spkemb_root}")
41
+ os.mkdir(spkemb_root)
42
+ device = "cuda" if torch.cuda.is_available() else "cpu"
43
+ classifier = EncoderClassifier.from_hparams(source=args.speaker_embed, run_opts={"device": device}, savedir='/tmp')
44
+ size_embed = spk_model[args.speaker_embed]
45
+ resampler = T.Resample(24000, 16000)
46
+ for utt_i in tqdm(wavlst, total=len(wavlst), desc="Extract"):
47
+ utt_id = "-".join(utt_i.split("/")[-3:]).replace(".wav", "")
48
+ utt_emb = f2embed(utt_i, classifier, size_embed, resampler)
49
+ numpy.save(os.path.join(spkemb_root, f"{utt_id}.npy"), utt_emb)
50
+
51
+ def main():
52
+ parser = argparse.ArgumentParser()
53
+ parser.add_argument("--libritts-root", "-i", required=True, type=str, help="LibriTTS root directory.")
54
+ parser.add_argument("--output-root", "-o", required=True, type=str, help="Output directory.")
55
+ parser.add_argument("--speaker-embed", "-s", type=str, required=True, choices=["speechbrain/spkrec-xvect-voxceleb", "speechbrain/spkrec-ecapa-voxceleb"],
56
+ help="Pretrained model for extracting speaker emebdding.")
57
+ parser.add_argument("--splits", default="train-clean-100,train-clean-360,dev-clean,test-clean", type=str,
58
+ help="Split of train,dev,test seperate by comma.")
59
+ args = parser.parse_args()
60
+ print(f"Loading utterances from {args.libritts_root}/{args.splits}, "
61
+ + f"Save speaker embedding 'npy' to {args.output_root}, "
62
+ + f"Using speaker model {args.speaker_embed} with {spk_model[args.speaker_embed]} size.")
63
+ process(args)
64
+
65
+ if __name__ == "__main__":
66
+ """
67
+ python examples/text_to_speech/prep_libritts_spkemb.py \
68
+ -i /mnt/default/v-junyiao/dataset/Original/LibriTTS \
69
+ -o /mnt/default/v-junyiao/dataset/Original/LibriTTS/spkrec-ecapa \
70
+ -s speechbrain/spkrec-ecapa-voxceleb
71
+ """
72
+ main()
manifest/utils/resample_libritts.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+ from shutil import copyfile
3
+ import soundfile as sf
4
+ import librosa
5
+ import os
6
+
7
+ #LibriTTS
8
+ # 1.6G /root/data/libritts/LibriTTS/dev-clean
9
+ # 1.5G /root/data/libritts/LibriTTS/test-clean
10
+ # 9.1G /root/data/libritts/LibriTTS/train-clean-100
11
+ # 33G /root/data/libritts/LibriTTS/train-clean-360
12
+ # 44G /root/data/libritts/LibriTTS
13
+
14
+ #LibriTTS_16k
15
+
16
+ # The pattern "**" means all subdirectories recursively,
17
+ # with "*.wav" meaning all files with any name ending in ".wav".
18
+ dest_dir = Path("/root/data/libritts/LibriTTS_16k")
19
+ dest_dir.mkdir(exist_ok=True)
20
+ for file in Path("/root/data/libritts/LibriTTS").glob("**/*"):
21
+ if not file.is_file(): # Skip directories
22
+ continue
23
+
24
+ file = str(file)
25
+ new_path = Path(file.replace('LibriTTS', 'LibriTTS_16k'))
26
+ os.system('mkdir -p ' + str(new_path.parent))
27
+ if file.endswith('wav'):
28
+ audio, fs = sf.read(file)
29
+ x = librosa.resample(audio, fs, 16000)
30
+ sf.write(str(new_path), x, 16000)
31
+ # librosa.output.write_wav(str(new_path), x, 16000)
32
+ else:
33
+ copyfile(file, file.replace('LibriTTS', 'LibriTTS_16k'))
manifest/utils/spec2wav.sh ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
1
+ feats_root=$1
2
+ wav_root=`dirname ${feats_root}`/gen_wav
3
+
4
+ parallel-wavegan-decode \
5
+ --checkpoint train_nodev_clean_libritts_hifigan.v1/hifigan-libritts-1930000steps.pkl \
6
+ --dumpdir ${feats_root} \
7
+ --outdir ${wav_root} \
8
+ --normalize-before
pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/config.yml ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ allow_cache: false
2
+ batch_max_steps: 8192
3
+ batch_size: 16
4
+ config: conf/hifigan.v1.yaml
5
+ dev_dumpdir: dump/dev_clean/norm
6
+ dev_feats_scp: null
7
+ dev_segments: null
8
+ dev_wav_scp: null
9
+ discriminator_adv_loss_params:
10
+ average_by_discriminators: false
11
+ discriminator_grad_norm: -1
12
+ discriminator_optimizer_params:
13
+ betas:
14
+ - 0.5
15
+ - 0.9
16
+ lr: 0.0002
17
+ weight_decay: 0.0
18
+ discriminator_optimizer_type: Adam
19
+ discriminator_params:
20
+ follow_official_norm: true
21
+ period_discriminator_params:
22
+ bias: true
23
+ channels: 32
24
+ downsample_scales:
25
+ - 3
26
+ - 3
27
+ - 3
28
+ - 3
29
+ - 1
30
+ in_channels: 1
31
+ kernel_sizes:
32
+ - 5
33
+ - 3
34
+ max_downsample_channels: 1024
35
+ nonlinear_activation: LeakyReLU
36
+ nonlinear_activation_params:
37
+ negative_slope: 0.1
38
+ out_channels: 1
39
+ use_spectral_norm: false
40
+ use_weight_norm: true
41
+ periods:
42
+ - 2
43
+ - 3
44
+ - 5
45
+ - 7
46
+ - 11
47
+ scale_discriminator_params:
48
+ bias: true
49
+ channels: 128
50
+ downsample_scales:
51
+ - 4
52
+ - 4
53
+ - 4
54
+ - 4
55
+ - 1
56
+ in_channels: 1
57
+ kernel_sizes:
58
+ - 15
59
+ - 41
60
+ - 5
61
+ - 3
62
+ max_downsample_channels: 1024
63
+ max_groups: 16
64
+ nonlinear_activation: LeakyReLU
65
+ nonlinear_activation_params:
66
+ negative_slope: 0.1
67
+ out_channels: 1
68
+ scale_downsample_pooling: AvgPool1d
69
+ scale_downsample_pooling_params:
70
+ kernel_size: 4
71
+ padding: 2
72
+ stride: 2
73
+ scales: 3
74
+ discriminator_scheduler_params:
75
+ gamma: 0.5
76
+ milestones:
77
+ - 200000
78
+ - 400000
79
+ - 600000
80
+ - 800000
81
+ discriminator_scheduler_type: MultiStepLR
82
+ discriminator_train_start_steps: 0
83
+ discriminator_type: HiFiGANMultiScaleMultiPeriodDiscriminator
84
+ distributed: true
85
+ eval_interval_steps: 1000
86
+ feat_match_loss_params:
87
+ average_by_discriminators: false
88
+ average_by_layers: false
89
+ include_final_outputs: false
90
+ fft_size: 1024
91
+ fmax: 7600
92
+ fmin: 80
93
+ format: npy
94
+ generator_adv_loss_params:
95
+ average_by_discriminators: false
96
+ generator_grad_norm: -1
97
+ generator_optimizer_params:
98
+ betas:
99
+ - 0.5
100
+ - 0.9
101
+ lr: 0.0002
102
+ weight_decay: 0.0
103
+ generator_optimizer_type: Adam
104
+ generator_params:
105
+ bias: true
106
+ channels: 512
107
+ in_channels: 80
108
+ kernel_size: 7
109
+ nonlinear_activation: LeakyReLU
110
+ nonlinear_activation_params:
111
+ negative_slope: 0.1
112
+ out_channels: 1
113
+ resblock_dilations:
114
+ - - 1
115
+ - 3
116
+ - 5
117
+ - - 1
118
+ - 3
119
+ - 5
120
+ - - 1
121
+ - 3
122
+ - 5
123
+ resblock_kernel_sizes:
124
+ - 3
125
+ - 7
126
+ - 11
127
+ upsample_kernal_sizes:
128
+ - 8
129
+ - 8
130
+ - 8
131
+ - 8
132
+ upsample_scales:
133
+ - 4
134
+ - 4
135
+ - 4
136
+ - 4
137
+ use_additional_convs: true
138
+ use_weight_norm: true
139
+ generator_scheduler_params:
140
+ gamma: 0.5
141
+ milestones:
142
+ - 200000
143
+ - 400000
144
+ - 600000
145
+ - 800000
146
+ generator_scheduler_type: MultiStepLR
147
+ generator_train_start_steps: 1
148
+ generator_type: HiFiGANGenerator
149
+ global_gain_scale: 1.0
150
+ hop_size: 256
151
+ lambda_adv: 1.0
152
+ lambda_aux: 45.0
153
+ lambda_feat_match: 2.0
154
+ log_interval_steps: 100
155
+ mel_loss_params:
156
+ fft_size: 1024
157
+ fmax: 7600
158
+ fmin: 80
159
+ fs: 16000
160
+ hop_size: 256
161
+ log_base: null
162
+ num_mels: 80
163
+ win_length: 1024
164
+ window: hann
165
+ num_mels: 80
166
+ num_save_intermediate_results: 4
167
+ num_workers: 4
168
+ outdir: exp/train_nodev_clean_libritts_hifigan.v1
169
+ pin_memory: true
170
+ pretrain: ''
171
+ rank: 1
172
+ remove_short_samples: false
173
+ resume: /mnt/default/v-junyiao/libritts_vocoder2/train_nodev_clean_libritts_hifigan.v1/checkpoint-50000steps.pkl
174
+ sampling_rate: 16000
175
+ save_interval_steps: 10000
176
+ train_dumpdir: dump/train_nodev_clean/norm
177
+ train_feats_scp: null
178
+ train_max_steps: 2500000
179
+ train_segments: null
180
+ train_wav_scp: null
181
+ trim_frame_size: 1024
182
+ trim_hop_size: 256
183
+ trim_silence: false
184
+ trim_threshold_in_db: 20
185
+ use_feat_match_loss: true
186
+ use_mel_loss: true
187
+ use_stft_loss: false
188
+ verbose: 1
189
+ version: 0.5.1
190
+ win_length: 1024
191
+ window: hann
192
+ world_size: 2
pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/hifigan-libritts-1930000steps.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b119deddc85a78061bed39aaa5c2f9a8093e2701c46d9a0f9a25b2ac52457e4
3
+ size 333645593
pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/stats.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c1a72747d543205699e741ae3092d83b233b30e4974fe1991d553d11e895c535
3
+ size 768