wr
commited on
Commit
·
31ad50e
1
Parent(s):
f9fe32e
set *.tsv and *.txt to large file
Browse files- .gitattributes +2 -0
- README.md +40 -0
- manifest/TTS_examples.txt +3 -0
- manifest/dev-clean.tsv +3 -0
- manifest/dev-clean.txt +3 -0
- manifest/dict.txt +3 -0
- manifest/spm_char.model +3 -0
- manifest/test-clean-200.tsv +3 -0
- manifest/test-clean-200.txt +3 -0
- manifest/test-clean.tsv +3 -0
- manifest/test-clean.txt +3 -0
- manifest/train-clean-100.tsv +3 -0
- manifest/train-clean-100.txt +3 -0
- manifest/train-clean-360.tsv +3 -0
- manifest/train-clean-360.txt +3 -0
- manifest/utils/libritts_manifest.py +120 -0
- manifest/utils/make_tsv_txt.sh +13 -0
- manifest/utils/prep_libritts_spkemb.py +72 -0
- manifest/utils/resample_libritts.py +33 -0
- manifest/utils/spec2wav.sh +8 -0
- pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/config.yml +192 -0
- pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/hifigan-libritts-1930000steps.pkl +3 -0
- pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/stats.npy +3 -0
.gitattributes
CHANGED
@@ -29,3 +29,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
29 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
30 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
31 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
29 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
30 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
31 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.txt filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.tsv filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,43 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
tags:
|
4 |
+
- speech
|
5 |
+
- text
|
6 |
+
- cross-modal
|
7 |
+
- unified model
|
8 |
+
- self-supervised learning
|
9 |
+
- SpeechT5
|
10 |
+
datasets:
|
11 |
+
- LibriTTS
|
12 |
---
|
13 |
+
|
14 |
+
## SpeechT5 TTS Manifest
|
15 |
+
|
16 |
+
| [**Github**](https://github.com/microsoft/SpeechT5) | [**Huggingface**](https://huggingface.co/mechanicalsea/speecht5-tts) |
|
17 |
+
|
18 |
+
This manifest is an attempt to recreate the Text-to-Speech recipe used for training [SpeechT5](https://aclanthology.org/2022.acl-long.393). This manifest was constructed using [LibriTTS](http://www.openslr.org/60/) clean datasets, including train-clean-100 and train-clean-360 for training, dev-clean for validation, and test-clean for evaluation. The test-clean-200 contains 200 utterances id for the mean option score (MOS), and the comparison mean option score (CMOS).
|
19 |
+
|
20 |
+
### Requirements
|
21 |
+
|
22 |
+
- [SpeechBrain](https://github.com/speechbrain/speechbrain) for extracting speaker embedding
|
23 |
+
- [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) for implementing vocoder.
|
24 |
+
|
25 |
+
### Tools
|
26 |
+
|
27 |
+
- [manifest/utils](./manifest/utils/) is used to downsample waveform, extract speaker embedding, generate manifest, and apply vocoder.
|
28 |
+
- [pretrained_vocoder](./pretrained_vocoder/) provides the pre-trained vocoder.
|
29 |
+
|
30 |
+
### Reference
|
31 |
+
|
32 |
+
If you find our work is useful in your research, please cite the following paper:
|
33 |
+
|
34 |
+
```bibtex
|
35 |
+
@inproceedings{ao-etal-2022-speecht5,
|
36 |
+
title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
|
37 |
+
author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
|
38 |
+
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
|
39 |
+
month = {May},
|
40 |
+
year = {2022},
|
41 |
+
pages={5723--5738},
|
42 |
+
}
|
43 |
+
```
|
manifest/TTS_examples.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c8e2db9c6294f35bd8952435aa506ebe38d5e7b5aebf01dee3e086f4d4f9685f
|
3 |
+
size 8018
|
manifest/dev-clean.tsv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c6cf77f21f3dab7dc8ca5e8470ee45f2ed1907304b05f1245f21febda73ea7d7
|
3 |
+
size 635339
|
manifest/dev-clean.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ec6d57b715e17da05dc462846d9fd1309e2f10c844cf2cc8566807741905ccd7
|
3 |
+
size 548224
|
manifest/dict.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:036438c7cb5fc860b1d1066a3b111542515b1d4ac1f5a79a15a2322e8f79f402
|
3 |
+
size 309
|
manifest/spm_char.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7fcc48f3e225f627b1641db410ceb0c8649bd2b0c982e150b03f8be3728ab560
|
3 |
+
size 238473
|
manifest/test-clean-200.tsv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b22354b2f305ba791d7efb72246a8ddb01cc832fcd1dcd123245faa9aa0a7931
|
3 |
+
size 22150
|
manifest/test-clean-200.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:39431d3e311a3a47935411d819c94c4f28161022cdc426f0b7f3d9dc0be9c569
|
3 |
+
size 22526
|
manifest/test-clean.tsv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:979bb2256a8138cf0492e2aa07628b815891bd0d81ac6a98d9d5d6889a176291
|
3 |
+
size 535922
|
manifest/test-clean.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2c4470877fc16c4135723c4bfe0784d47f0211bf6b12088ec6d293bbf5e4fac1
|
3 |
+
size 508964
|
manifest/train-clean-100.tsv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c93390d311316c02d6e7da4bf5ab0b93cb922f80b075f6dfc30ff14c33b33bf0
|
3 |
+
size 3864578
|
manifest/train-clean-100.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e40a9a117e7f588390bcb188ffad54830c37621a38d1e6e1f3f3f4e13885d863
|
3 |
+
size 3180343
|
manifest/train-clean-360.tsv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d14e7dfea4e60753aa6b882ee64472cf340174ff707c1e0f69e590b4373676ba
|
3 |
+
size 13582849
|
manifest/train-clean-360.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c03d42d6310f67293b3010ee207da940e1ba03adf1924f1e5b959d9370f73037
|
3 |
+
size 11483749
|
manifest/utils/libritts_manifest.py
ADDED
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import argparse
|
2 |
+
import os
|
3 |
+
from typing import Tuple
|
4 |
+
|
5 |
+
from scipy.io import wavfile
|
6 |
+
from torchaudio.datasets import LIBRITTS
|
7 |
+
from tqdm import tqdm
|
8 |
+
|
9 |
+
|
10 |
+
def load_libritts_item(
|
11 |
+
fileid: str,
|
12 |
+
path: str,
|
13 |
+
ext_audio: str,
|
14 |
+
ext_original_txt: str,
|
15 |
+
ext_normalized_txt: str,
|
16 |
+
) -> Tuple[int, int, str, str, int, int, str]:
|
17 |
+
speaker_id, chapter_id, segment_id, utterance_id = fileid.split("_")
|
18 |
+
utterance_id = fileid
|
19 |
+
|
20 |
+
normalized_text = utterance_id + ext_normalized_txt
|
21 |
+
normalized_text = os.path.join(path, speaker_id, chapter_id, normalized_text)
|
22 |
+
|
23 |
+
original_text = utterance_id + ext_original_txt
|
24 |
+
original_text = os.path.join(path, speaker_id, chapter_id, original_text)
|
25 |
+
|
26 |
+
file_audio = utterance_id + ext_audio
|
27 |
+
file_audio = os.path.join(path, speaker_id, chapter_id, file_audio)
|
28 |
+
|
29 |
+
# Load audio
|
30 |
+
sample_rate, wav = wavfile.read(file_audio)
|
31 |
+
n_frames = wav.shape[0]
|
32 |
+
|
33 |
+
# Load original text
|
34 |
+
# with open(original_text) as ft:
|
35 |
+
# original_text = ft.readline()
|
36 |
+
|
37 |
+
# Load normalized text
|
38 |
+
with open(normalized_text, "r") as ft:
|
39 |
+
normalized_text = ft.readline()
|
40 |
+
|
41 |
+
return (
|
42 |
+
n_frames,
|
43 |
+
sample_rate,
|
44 |
+
None,
|
45 |
+
normalized_text,
|
46 |
+
int(speaker_id),
|
47 |
+
int(chapter_id),
|
48 |
+
utterance_id,
|
49 |
+
)
|
50 |
+
|
51 |
+
|
52 |
+
class LIBRITTS_16K(LIBRITTS):
|
53 |
+
def __getitem__(self, n: int) -> Tuple[int, int, str, str, int, int, str]:
|
54 |
+
"""Load the n-th sample from the dataset.
|
55 |
+
|
56 |
+
Args:
|
57 |
+
n (int): The index of the sample to be loaded
|
58 |
+
|
59 |
+
Returns:
|
60 |
+
(Tensor, int, str, str, str, int, int, str):
|
61 |
+
``(waveform_length, sample_rate, original_text, normalized_text, speaker_id, chapter_id, utterance_id)``
|
62 |
+
"""
|
63 |
+
fileid = self._walker[n]
|
64 |
+
return load_libritts_item(
|
65 |
+
fileid,
|
66 |
+
self._path,
|
67 |
+
self._ext_audio,
|
68 |
+
self._ext_original_txt,
|
69 |
+
self._ext_normalized_txt,
|
70 |
+
)
|
71 |
+
|
72 |
+
|
73 |
+
def get_parser():
|
74 |
+
parser = argparse.ArgumentParser()
|
75 |
+
parser.add_argument(
|
76 |
+
"root", metavar="DIR", help="root directory containing wav files to index"
|
77 |
+
)
|
78 |
+
parser.add_argument(
|
79 |
+
"--dest", default=".", type=str, metavar="DIR", help="output directory"
|
80 |
+
)
|
81 |
+
parser.add_argument(
|
82 |
+
"--split", required=True, type=str, help="dataset splits"
|
83 |
+
)
|
84 |
+
parser.add_argument(
|
85 |
+
"--wav-root", default=None, type=str, metavar="DIR", help="saved waveform root directory for tsv"
|
86 |
+
)
|
87 |
+
parser.add_argument(
|
88 |
+
"--spkemb-npy-dir", required=True, type=str, help="speaker embedding directory"
|
89 |
+
)
|
90 |
+
return parser
|
91 |
+
|
92 |
+
def main(args):
|
93 |
+
dest_dir = args.dest
|
94 |
+
wav_root = args.wav_root
|
95 |
+
if not os.path.exists(dest_dir):
|
96 |
+
os.makedirs(dest_dir)
|
97 |
+
|
98 |
+
dataset = LIBRITTS_16K(os.path.dirname(args.root), url=args.split, folder_in_archive=os.path.basename(args.root))
|
99 |
+
tsv_f = open(os.path.join(dest_dir, f"{args.split}.tsv"), "w")
|
100 |
+
txt_f = open(os.path.join(dest_dir, f"{args.split}.txt"), "w")
|
101 |
+
print(wav_root, file=tsv_f)
|
102 |
+
|
103 |
+
for n_frames, sr, ori_text, norm_text, spk_id, chap_id, utt_id in tqdm(dataset, desc="tsv/txt/wav"):
|
104 |
+
assert sr == 16000, f"sampling rate {sr} != 16000"
|
105 |
+
utt_file = os.path.join(args.split, f"{spk_id}", f"{chap_id}", f"{utt_id}.wav")
|
106 |
+
spk_file = os.path.join(args.spkemb_npy_dir, f"{spk_id}-{chap_id}-{utt_id}.npy")
|
107 |
+
assert os.path.exists(os.path.join(wav_root, utt_file))
|
108 |
+
assert os.path.exists(os.path.join(wav_root, spk_file))
|
109 |
+
|
110 |
+
print(f"{utt_file}\t{n_frames}\t{spk_file}", file=tsv_f)
|
111 |
+
print(norm_text, file=txt_f)
|
112 |
+
|
113 |
+
tsv_f.close()
|
114 |
+
txt_f.close()
|
115 |
+
|
116 |
+
|
117 |
+
if __name__ == "__main__":
|
118 |
+
parser = get_parser()
|
119 |
+
args = parser.parse_args()
|
120 |
+
main(args)
|
manifest/utils/make_tsv_txt.sh
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/bin/bash
|
2 |
+
# bash utils/make_tsv_txt.sh /mnt/bn/wangrui2022/wangrui2022/libritts/LibriTTS_16k /opt/tiger/libritts_finetuning_meta /opt/tiger/LibriTTS_16k
|
3 |
+
root=$1
|
4 |
+
dest=$2
|
5 |
+
wav_root=$3
|
6 |
+
spkemb_split=$4
|
7 |
+
if [ -z ${spkemb_split} ]; then
|
8 |
+
spkemb_split=spkrec-xvect
|
9 |
+
fi
|
10 |
+
for split in dev-clean test-clean train-clean-100 train-clean-360; do
|
11 |
+
echo "making ${split}.tsv and ${split}.txt ..."
|
12 |
+
python utils/libritts_manifest.py ${root} --dest ${dest} --split ${split} --wav-root ${wav_root} --spkemb-npy-dir ${spkemb_split}
|
13 |
+
done
|
manifest/utils/prep_libritts_spkemb.py
ADDED
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import glob
|
3 |
+
import numpy
|
4 |
+
import argparse
|
5 |
+
import torchaudio
|
6 |
+
from speechbrain.pretrained import EncoderClassifier
|
7 |
+
import torch
|
8 |
+
from tqdm import tqdm
|
9 |
+
import torch.nn.functional as F
|
10 |
+
import torchaudio.transforms as T
|
11 |
+
|
12 |
+
spk_model = {
|
13 |
+
"speechbrain/spkrec-xvect-voxceleb": 512,
|
14 |
+
"speechbrain/spkrec-ecapa-voxceleb": 192,
|
15 |
+
}
|
16 |
+
|
17 |
+
def f2embed(wav_file, classifier, size_embed, resampler=None):
|
18 |
+
signal, fs =torchaudio.load(wav_file)
|
19 |
+
if fs != 16000 and fs is not None:
|
20 |
+
assert fs == 24000, fs
|
21 |
+
signal = resampler(signal)
|
22 |
+
fs = 16000
|
23 |
+
assert fs == 16000, fs
|
24 |
+
with torch.no_grad():
|
25 |
+
embeddings = classifier.encode_batch(signal)
|
26 |
+
embeddings = F.normalize(embeddings, dim=2)
|
27 |
+
embeddings = embeddings.squeeze().cpu().numpy()
|
28 |
+
assert embeddings.shape[0] == size_embed, embeddings.shape[0]
|
29 |
+
return embeddings
|
30 |
+
|
31 |
+
def process(args):
|
32 |
+
wavlst = []
|
33 |
+
for split in args.splits.split(","):
|
34 |
+
wav_dir = os.path.join(args.libritts_root, split)
|
35 |
+
wavlst_split = glob.glob(os.path.join(wav_dir, "*", "*", "*.wav"))
|
36 |
+
print(f"{split} {len(wavlst_split)} utterances.")
|
37 |
+
wavlst.extend(wavlst_split)
|
38 |
+
spkemb_root = args.output_root
|
39 |
+
if not os.path.exists(spkemb_root):
|
40 |
+
print(f"Create speaker embedding directory: {spkemb_root}")
|
41 |
+
os.mkdir(spkemb_root)
|
42 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
43 |
+
classifier = EncoderClassifier.from_hparams(source=args.speaker_embed, run_opts={"device": device}, savedir='/tmp')
|
44 |
+
size_embed = spk_model[args.speaker_embed]
|
45 |
+
resampler = T.Resample(24000, 16000)
|
46 |
+
for utt_i in tqdm(wavlst, total=len(wavlst), desc="Extract"):
|
47 |
+
utt_id = "-".join(utt_i.split("/")[-3:]).replace(".wav", "")
|
48 |
+
utt_emb = f2embed(utt_i, classifier, size_embed, resampler)
|
49 |
+
numpy.save(os.path.join(spkemb_root, f"{utt_id}.npy"), utt_emb)
|
50 |
+
|
51 |
+
def main():
|
52 |
+
parser = argparse.ArgumentParser()
|
53 |
+
parser.add_argument("--libritts-root", "-i", required=True, type=str, help="LibriTTS root directory.")
|
54 |
+
parser.add_argument("--output-root", "-o", required=True, type=str, help="Output directory.")
|
55 |
+
parser.add_argument("--speaker-embed", "-s", type=str, required=True, choices=["speechbrain/spkrec-xvect-voxceleb", "speechbrain/spkrec-ecapa-voxceleb"],
|
56 |
+
help="Pretrained model for extracting speaker emebdding.")
|
57 |
+
parser.add_argument("--splits", default="train-clean-100,train-clean-360,dev-clean,test-clean", type=str,
|
58 |
+
help="Split of train,dev,test seperate by comma.")
|
59 |
+
args = parser.parse_args()
|
60 |
+
print(f"Loading utterances from {args.libritts_root}/{args.splits}, "
|
61 |
+
+ f"Save speaker embedding 'npy' to {args.output_root}, "
|
62 |
+
+ f"Using speaker model {args.speaker_embed} with {spk_model[args.speaker_embed]} size.")
|
63 |
+
process(args)
|
64 |
+
|
65 |
+
if __name__ == "__main__":
|
66 |
+
"""
|
67 |
+
python examples/text_to_speech/prep_libritts_spkemb.py \
|
68 |
+
-i /mnt/default/v-junyiao/dataset/Original/LibriTTS \
|
69 |
+
-o /mnt/default/v-junyiao/dataset/Original/LibriTTS/spkrec-ecapa \
|
70 |
+
-s speechbrain/spkrec-ecapa-voxceleb
|
71 |
+
"""
|
72 |
+
main()
|
manifest/utils/resample_libritts.py
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from pathlib import Path
|
2 |
+
from shutil import copyfile
|
3 |
+
import soundfile as sf
|
4 |
+
import librosa
|
5 |
+
import os
|
6 |
+
|
7 |
+
#LibriTTS
|
8 |
+
# 1.6G /root/data/libritts/LibriTTS/dev-clean
|
9 |
+
# 1.5G /root/data/libritts/LibriTTS/test-clean
|
10 |
+
# 9.1G /root/data/libritts/LibriTTS/train-clean-100
|
11 |
+
# 33G /root/data/libritts/LibriTTS/train-clean-360
|
12 |
+
# 44G /root/data/libritts/LibriTTS
|
13 |
+
|
14 |
+
#LibriTTS_16k
|
15 |
+
|
16 |
+
# The pattern "**" means all subdirectories recursively,
|
17 |
+
# with "*.wav" meaning all files with any name ending in ".wav".
|
18 |
+
dest_dir = Path("/root/data/libritts/LibriTTS_16k")
|
19 |
+
dest_dir.mkdir(exist_ok=True)
|
20 |
+
for file in Path("/root/data/libritts/LibriTTS").glob("**/*"):
|
21 |
+
if not file.is_file(): # Skip directories
|
22 |
+
continue
|
23 |
+
|
24 |
+
file = str(file)
|
25 |
+
new_path = Path(file.replace('LibriTTS', 'LibriTTS_16k'))
|
26 |
+
os.system('mkdir -p ' + str(new_path.parent))
|
27 |
+
if file.endswith('wav'):
|
28 |
+
audio, fs = sf.read(file)
|
29 |
+
x = librosa.resample(audio, fs, 16000)
|
30 |
+
sf.write(str(new_path), x, 16000)
|
31 |
+
# librosa.output.write_wav(str(new_path), x, 16000)
|
32 |
+
else:
|
33 |
+
copyfile(file, file.replace('LibriTTS', 'LibriTTS_16k'))
|
manifest/utils/spec2wav.sh
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
feats_root=$1
|
2 |
+
wav_root=`dirname ${feats_root}`/gen_wav
|
3 |
+
|
4 |
+
parallel-wavegan-decode \
|
5 |
+
--checkpoint train_nodev_clean_libritts_hifigan.v1/hifigan-libritts-1930000steps.pkl \
|
6 |
+
--dumpdir ${feats_root} \
|
7 |
+
--outdir ${wav_root} \
|
8 |
+
--normalize-before
|
pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/config.yml
ADDED
@@ -0,0 +1,192 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
allow_cache: false
|
2 |
+
batch_max_steps: 8192
|
3 |
+
batch_size: 16
|
4 |
+
config: conf/hifigan.v1.yaml
|
5 |
+
dev_dumpdir: dump/dev_clean/norm
|
6 |
+
dev_feats_scp: null
|
7 |
+
dev_segments: null
|
8 |
+
dev_wav_scp: null
|
9 |
+
discriminator_adv_loss_params:
|
10 |
+
average_by_discriminators: false
|
11 |
+
discriminator_grad_norm: -1
|
12 |
+
discriminator_optimizer_params:
|
13 |
+
betas:
|
14 |
+
- 0.5
|
15 |
+
- 0.9
|
16 |
+
lr: 0.0002
|
17 |
+
weight_decay: 0.0
|
18 |
+
discriminator_optimizer_type: Adam
|
19 |
+
discriminator_params:
|
20 |
+
follow_official_norm: true
|
21 |
+
period_discriminator_params:
|
22 |
+
bias: true
|
23 |
+
channels: 32
|
24 |
+
downsample_scales:
|
25 |
+
- 3
|
26 |
+
- 3
|
27 |
+
- 3
|
28 |
+
- 3
|
29 |
+
- 1
|
30 |
+
in_channels: 1
|
31 |
+
kernel_sizes:
|
32 |
+
- 5
|
33 |
+
- 3
|
34 |
+
max_downsample_channels: 1024
|
35 |
+
nonlinear_activation: LeakyReLU
|
36 |
+
nonlinear_activation_params:
|
37 |
+
negative_slope: 0.1
|
38 |
+
out_channels: 1
|
39 |
+
use_spectral_norm: false
|
40 |
+
use_weight_norm: true
|
41 |
+
periods:
|
42 |
+
- 2
|
43 |
+
- 3
|
44 |
+
- 5
|
45 |
+
- 7
|
46 |
+
- 11
|
47 |
+
scale_discriminator_params:
|
48 |
+
bias: true
|
49 |
+
channels: 128
|
50 |
+
downsample_scales:
|
51 |
+
- 4
|
52 |
+
- 4
|
53 |
+
- 4
|
54 |
+
- 4
|
55 |
+
- 1
|
56 |
+
in_channels: 1
|
57 |
+
kernel_sizes:
|
58 |
+
- 15
|
59 |
+
- 41
|
60 |
+
- 5
|
61 |
+
- 3
|
62 |
+
max_downsample_channels: 1024
|
63 |
+
max_groups: 16
|
64 |
+
nonlinear_activation: LeakyReLU
|
65 |
+
nonlinear_activation_params:
|
66 |
+
negative_slope: 0.1
|
67 |
+
out_channels: 1
|
68 |
+
scale_downsample_pooling: AvgPool1d
|
69 |
+
scale_downsample_pooling_params:
|
70 |
+
kernel_size: 4
|
71 |
+
padding: 2
|
72 |
+
stride: 2
|
73 |
+
scales: 3
|
74 |
+
discriminator_scheduler_params:
|
75 |
+
gamma: 0.5
|
76 |
+
milestones:
|
77 |
+
- 200000
|
78 |
+
- 400000
|
79 |
+
- 600000
|
80 |
+
- 800000
|
81 |
+
discriminator_scheduler_type: MultiStepLR
|
82 |
+
discriminator_train_start_steps: 0
|
83 |
+
discriminator_type: HiFiGANMultiScaleMultiPeriodDiscriminator
|
84 |
+
distributed: true
|
85 |
+
eval_interval_steps: 1000
|
86 |
+
feat_match_loss_params:
|
87 |
+
average_by_discriminators: false
|
88 |
+
average_by_layers: false
|
89 |
+
include_final_outputs: false
|
90 |
+
fft_size: 1024
|
91 |
+
fmax: 7600
|
92 |
+
fmin: 80
|
93 |
+
format: npy
|
94 |
+
generator_adv_loss_params:
|
95 |
+
average_by_discriminators: false
|
96 |
+
generator_grad_norm: -1
|
97 |
+
generator_optimizer_params:
|
98 |
+
betas:
|
99 |
+
- 0.5
|
100 |
+
- 0.9
|
101 |
+
lr: 0.0002
|
102 |
+
weight_decay: 0.0
|
103 |
+
generator_optimizer_type: Adam
|
104 |
+
generator_params:
|
105 |
+
bias: true
|
106 |
+
channels: 512
|
107 |
+
in_channels: 80
|
108 |
+
kernel_size: 7
|
109 |
+
nonlinear_activation: LeakyReLU
|
110 |
+
nonlinear_activation_params:
|
111 |
+
negative_slope: 0.1
|
112 |
+
out_channels: 1
|
113 |
+
resblock_dilations:
|
114 |
+
- - 1
|
115 |
+
- 3
|
116 |
+
- 5
|
117 |
+
- - 1
|
118 |
+
- 3
|
119 |
+
- 5
|
120 |
+
- - 1
|
121 |
+
- 3
|
122 |
+
- 5
|
123 |
+
resblock_kernel_sizes:
|
124 |
+
- 3
|
125 |
+
- 7
|
126 |
+
- 11
|
127 |
+
upsample_kernal_sizes:
|
128 |
+
- 8
|
129 |
+
- 8
|
130 |
+
- 8
|
131 |
+
- 8
|
132 |
+
upsample_scales:
|
133 |
+
- 4
|
134 |
+
- 4
|
135 |
+
- 4
|
136 |
+
- 4
|
137 |
+
use_additional_convs: true
|
138 |
+
use_weight_norm: true
|
139 |
+
generator_scheduler_params:
|
140 |
+
gamma: 0.5
|
141 |
+
milestones:
|
142 |
+
- 200000
|
143 |
+
- 400000
|
144 |
+
- 600000
|
145 |
+
- 800000
|
146 |
+
generator_scheduler_type: MultiStepLR
|
147 |
+
generator_train_start_steps: 1
|
148 |
+
generator_type: HiFiGANGenerator
|
149 |
+
global_gain_scale: 1.0
|
150 |
+
hop_size: 256
|
151 |
+
lambda_adv: 1.0
|
152 |
+
lambda_aux: 45.0
|
153 |
+
lambda_feat_match: 2.0
|
154 |
+
log_interval_steps: 100
|
155 |
+
mel_loss_params:
|
156 |
+
fft_size: 1024
|
157 |
+
fmax: 7600
|
158 |
+
fmin: 80
|
159 |
+
fs: 16000
|
160 |
+
hop_size: 256
|
161 |
+
log_base: null
|
162 |
+
num_mels: 80
|
163 |
+
win_length: 1024
|
164 |
+
window: hann
|
165 |
+
num_mels: 80
|
166 |
+
num_save_intermediate_results: 4
|
167 |
+
num_workers: 4
|
168 |
+
outdir: exp/train_nodev_clean_libritts_hifigan.v1
|
169 |
+
pin_memory: true
|
170 |
+
pretrain: ''
|
171 |
+
rank: 1
|
172 |
+
remove_short_samples: false
|
173 |
+
resume: /mnt/default/v-junyiao/libritts_vocoder2/train_nodev_clean_libritts_hifigan.v1/checkpoint-50000steps.pkl
|
174 |
+
sampling_rate: 16000
|
175 |
+
save_interval_steps: 10000
|
176 |
+
train_dumpdir: dump/train_nodev_clean/norm
|
177 |
+
train_feats_scp: null
|
178 |
+
train_max_steps: 2500000
|
179 |
+
train_segments: null
|
180 |
+
train_wav_scp: null
|
181 |
+
trim_frame_size: 1024
|
182 |
+
trim_hop_size: 256
|
183 |
+
trim_silence: false
|
184 |
+
trim_threshold_in_db: 20
|
185 |
+
use_feat_match_loss: true
|
186 |
+
use_mel_loss: true
|
187 |
+
use_stft_loss: false
|
188 |
+
verbose: 1
|
189 |
+
version: 0.5.1
|
190 |
+
win_length: 1024
|
191 |
+
window: hann
|
192 |
+
world_size: 2
|
pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/hifigan-libritts-1930000steps.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0b119deddc85a78061bed39aaa5c2f9a8093e2701c46d9a0f9a25b2ac52457e4
|
3 |
+
size 333645593
|
pretrained_vocoder/train_nodev_clean_libritts_hifigan.v1/stats.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c1a72747d543205699e741ae3092d83b233b30e4974fe1991d553d11e895c535
|
3 |
+
size 768
|