gooorillax commited on
Commit
12da045
·
1 Parent(s): 0e94272

add example codes in README, refine README, add DSTK to whitelist

Browse files
README.md CHANGED
@@ -32,7 +32,7 @@ As shown in the figure below, the 3 module could form a pipeline for TTS task.
32
  As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
33
  <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
34
 
35
- These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset:
36
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
37
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
38
 
@@ -46,12 +46,163 @@ We also evaluated the ASR performance of our semantic tokenizer using a LLM as b
46
 
47
  ## Installation
48
 
 
49
  ### Create a separate environment if needed
50
 
51
  ```bash
52
  # Create a conda env with python_version>=3.10 (you could also use virtualenv)
53
  conda create -n dstk python=3.10
54
  conda activate dstk
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ```
56
 
57
  ## More tools to be release:
 
32
  As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
33
  <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
34
 
35
+ These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and less supervised data for training:
36
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
37
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
38
 
 
46
 
47
  ## Installation
48
 
49
+ ### Hardware: Ascend 910B with CANN 8.1 RC1 or GPU
50
  ### Create a separate environment if needed
51
 
52
  ```bash
53
  # Create a conda env with python_version>=3.10 (you could also use virtualenv)
54
  conda create -n dstk python=3.10
55
  conda activate dstk
56
+
57
+ # run install_requirements.sh to setup enviroment for DSTK inference for Ascend 910B
58
+ # for GPUs, just remove torch-npu==2.5.1 from requirements_npu.txt
59
+ sh install_requirements.sh
60
+
61
+ # patch for G2P
62
+ # modify the first line in thirdparty/G2P/patch_for_deps.sh:
63
+ # SITE_PATH=/path/to/your/own/site-packages
64
+ # run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
65
+ sh thirdparty/G2P/patch_for_deps.sh
66
+ ```
67
+
68
+ ### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
69
+
70
+ ## Usage:
71
+ ### Pipelines
72
+
73
+
74
+ ```python
75
+ import sys
76
+ import soundfile as sf
77
+
78
+ dstk_path = "/path/to/DSTK"
79
+ sys.path.append(dstk_path)
80
+
81
+ from reconstuction_example import ReconstructionPipeline
82
+ from tts_example import TTSPipeline
83
+
84
+ ref_wav_path = dstk_path + "/00004557-00000030.wav"
85
+ input_wav_path = dstk_path + "/004892.wav"
86
+ vocoder_path = "/path/to/vocos-mel-24khz"
87
+
88
+ reconsturctor = ReconstructionPipeline(
89
+ detok_vocoder=vocoder_path,
90
+ )
91
+
92
+ tts = TTSPipeline(
93
+ detok_vocoder=vocoder_path,
94
+ max_seg_len=30,
95
+ )
96
+
97
+ # for non-parallel speech reconstruction
98
+ generated_wave, target_sample_rate = reconsturctor.reconstruct(
99
+ ref_wav_path, input_wav_path
100
+ )
101
+
102
+ with open("./recon.wav", "wb") as f:
103
+ sf.write(f.name, generated_wave, target_sample_rate)
104
+ print(f"write output to: {f.name}")
105
+
106
+ # for tts
107
+ ref_wav_path = input_wav_path
108
+ generated_wave, target_sample_rate = tts.synthesize(
109
+ ref_wav_path,
110
+ "荷花未全谢,又到中秋节。家家户户把月饼切,庆中秋。美酒多欢乐,整杯盘,猜拳行令,同赏月。",
111
+ )
112
+ with open("./tts.wav", "wb") as f:
113
+ sf.write(f.name, generated_wave, target_sample_rate)
114
+ print(f"write output to: {f.name}")
115
+
116
+ print("Finished")
117
+ ```
118
+
119
+ ### Tokenization
120
+ ```python
121
+ import sys
122
+ import librosa
123
+
124
+ dstk_path = "/path/to/DSTK"
125
+ sys.path.append(dstk_path)
126
+
127
+ input_wav_path = dstk_path + "/004892.wav"
128
+
129
+ from semantic_tokenizer.f40ms.simple_tokenizer_infer import SpeechTokenizer
130
+
131
+ tokenizer = SpeechTokenizer()
132
+
133
+ raw_wav, sr = librosa.load(input_wav_path, sr=16000)
134
+ token_list, token_info_list = tokenizer.extract([raw_wav]) # 传入波形数据
135
+ for token_info in token_info_list:
136
+ print(token_info["unit_sequence"] + "\n")
137
+ print(token_info["reduced_unit_sequence"] + "\n")
138
+ ```
139
+
140
+ ### Text2Token
141
+ ```python
142
+ import sys
143
+ import librosa
144
+
145
+ dstk_path = "/path/to/DSTK"
146
+ sys.path.append(dstk_path)
147
+
148
+ from text2token.simple_infer import Text2TokenGenerator
149
+
150
+ input_text = "从离散语音token重建语音波形"
151
+ MAX_SEG_LEN = 30
152
+
153
+ t2u = Text2TokenGenerator()
154
+
155
+ phones = t2u.text2phone(input_text.strip())
156
+ print("phonemes of input text: %s are [%s]" % (input_text, phones))
157
+
158
+ speech_tokens_info = t2u.generate_for_long_input_text(
159
+ [phones], max_segment_len=MAX_SEG_LEN
160
+ )
161
+
162
+ for infor in speech_tokens_info[0]:
163
+ print(" ".join(infor) + "\n")
164
+ ```
165
+ ### Detokenization
166
+ ```python
167
+ import sys
168
+ import soundfile as sf
169
+
170
+ dstk_path = "/path/to/DSTK"
171
+ sys.path.append(dstk_path)
172
+
173
+ from semantic_detokenizer.chunk_infer import SpeechDetokenizer
174
+
175
+ # 从离散语音token重建语音波形
176
+ input_tokens = "3953 3890 3489 456 2693 3239 3692 3810 3874 3882 2749 548 3202 4012 3490 3939 3988 411 722 826 2812 3883 3874 3810 3983 4086 3946 3747 3469 2537 3689 3434 1816 1242 2415 3942 3363 3865 2841 1700 1652 3241 3362 3363 3874 3882 2792 933 2253 2799 3692 3746 3882 2809 1001 2449 1016 3762 3882 3874 3810 3809 3983 4086 4018 3747 3461 2537 3624 3882 3382 581 1837 2413 3435 4005 2003 2890 3884 3690 3746 3938 3874 3873 3856"
177
+ vocoder_path = "/path/to/vocos-mel-24khz"
178
+ ref_wav_path = dstk_path + "/004892.wav"
179
+ # output of tokenizer given ref_wav as input
180
+ ref_tokens = "3936 3872 3809 3873 3817 3639 2591 539 1021 3641 3890 4069 2002 3537 2303 3773 3827 3875 3969 4072 2425 97 2537 3633 3690 3865 3920 3069 3582 3883 3818 3997 4031 4029 3946 3874 3733 3727 3214 506 3892 3787 3457 3552 3490 4014 991 1991 3885 3947 4069 1488 1016 3258 3710 52 2362 3961 2680 1569 1851 3897 3825 3752 3808 3800 3873 3808 3792"
181
+
182
+ token_chunk_len = 75
183
+ chunk_cond_proportion = 0.3
184
+ chunk_look_ahead = 10
185
+ max_ref_duration = 4.5
186
+ ref_audio_cut_from_head = False
187
+
188
+ detoker = SpeechDetokenizer(
189
+ vocoder_path=vocoder_path,
190
+ )
191
+
192
+ generated_wave, target_sample_rate = detoker.chunk_generate(
193
+ ref_wav_path,
194
+ ref_tokens.split(),
195
+ input_tokens.split(),
196
+ token_chunk_len,
197
+ chunk_cond_proportion,
198
+ chunk_look_ahead,
199
+ max_ref_duration,
200
+ ref_audio_cut_from_head,
201
+ )
202
+
203
+ with open("./detok.wav", "wb") as f:
204
+ sf.write(f.name, generated_wave, target_sample_rate)
205
+ print(f"write output to: {f.name}")
206
  ```
207
 
208
  ## More tools to be release:
README_CN.md CHANGED
@@ -23,7 +23,7 @@ V1.0
23
  串联使用tokenizer和detokenizer实现语音重建的功能
24
  <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
25
 
26
- 上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平:
27
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
28
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
29
 
@@ -38,12 +38,163 @@ V1.0
38
 
39
  ## Installation
40
 
 
41
  ### Create a separate environment if needed
42
 
43
  ```bash
44
  # Create a conda env with python_version>=3.10 (you could also use virtualenv)
45
  conda create -n dstk python=3.10
46
  conda activate dstk
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```
48
 
49
 
 
23
  串联使用tokenizer和detokenizer实现语音重建的功能
24
  <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
25
 
26
+ 上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平,二模型的参数量和使用的监督数据都远小于对照基线模型:
27
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
28
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
29
 
 
38
 
39
  ## Installation
40
 
41
+ ### Hardware: Ascend 910B with CANN 8.1 RC1 or GPU
42
  ### Create a separate environment if needed
43
 
44
  ```bash
45
  # Create a conda env with python_version>=3.10 (you could also use virtualenv)
46
  conda create -n dstk python=3.10
47
  conda activate dstk
48
+
49
+ # run install_requirements.sh to setup enviroment for DSTK inference for Ascend 910B
50
+ # for GPUs, just remove torch-npu==2.5.1 from requirements_npu.txt
51
+ sh install_requirements.sh
52
+
53
+ # patch for G2P
54
+ # modify the first line in thirdparty/G2P/patch_for_deps.sh:
55
+ # SITE_PATH=/path/to/your/own/site-packages
56
+ # run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
57
+ sh thirdparty/G2P/patch_for_deps.sh
58
+ ```
59
+
60
+ ### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
61
+
62
+ ## Usage:
63
+ ### Pipelines
64
+
65
+
66
+ ```python
67
+ import sys
68
+ import soundfile as sf
69
+
70
+ dstk_path = "/path/to/DSTK"
71
+ sys.path.append(dstk_path)
72
+
73
+ from reconstuction_example import ReconstructionPipeline
74
+ from tts_example import TTSPipeline
75
+
76
+ ref_wav_path = dstk_path + "/00004557-00000030.wav"
77
+ input_wav_path = dstk_path + "/004892.wav"
78
+ vocoder_path = "/path/to/vocos-mel-24khz"
79
+
80
+ reconsturctor = ReconstructionPipeline(
81
+ detok_vocoder=vocoder_path,
82
+ )
83
+
84
+ tts = TTSPipeline(
85
+ detok_vocoder=vocoder_path,
86
+ max_seg_len=30,
87
+ )
88
+
89
+ # for non-parallel speech reconstruction
90
+ generated_wave, target_sample_rate = reconsturctor.reconstruct(
91
+ ref_wav_path, input_wav_path
92
+ )
93
+
94
+ with open("./recon.wav", "wb") as f:
95
+ sf.write(f.name, generated_wave, target_sample_rate)
96
+ print(f"write output to: {f.name}")
97
+
98
+ # for tts
99
+ ref_wav_path = input_wav_path
100
+ generated_wave, target_sample_rate = tts.synthesize(
101
+ ref_wav_path,
102
+ "荷花未全谢,又到中秋节。家家户户把月饼切,庆中秋。美酒多欢乐,整杯盘,猜拳行令,同赏月。",
103
+ )
104
+ with open("./tts.wav", "wb") as f:
105
+ sf.write(f.name, generated_wave, target_sample_rate)
106
+ print(f"write output to: {f.name}")
107
+
108
+ print("Finished")
109
+ ```
110
+
111
+ ### Tokenization
112
+ ```python
113
+ import sys
114
+ import librosa
115
+
116
+ dstk_path = "/path/to/DSTK"
117
+ sys.path.append(dstk_path)
118
+
119
+ input_wav_path = dstk_path + "/004892.wav"
120
+
121
+ from semantic_tokenizer.f40ms.simple_tokenizer_infer import SpeechTokenizer
122
+
123
+ tokenizer = SpeechTokenizer()
124
+
125
+ raw_wav, sr = librosa.load(input_wav_path, sr=16000)
126
+ token_list, token_info_list = tokenizer.extract([raw_wav]) # 传入波形数据
127
+ for token_info in token_info_list:
128
+ print(token_info["unit_sequence"] + "\n")
129
+ print(token_info["reduced_unit_sequence"] + "\n")
130
+ ```
131
+
132
+ ### Text2Token
133
+ ```python
134
+ import sys
135
+ import librosa
136
+
137
+ dstk_path = "/path/to/DSTK"
138
+ sys.path.append(dstk_path)
139
+
140
+ from text2token.simple_infer import Text2TokenGenerator
141
+
142
+ input_text = "从离散语音token重建语音波形"
143
+ MAX_SEG_LEN = 30
144
+
145
+ t2u = Text2TokenGenerator()
146
+
147
+ phones = t2u.text2phone(input_text.strip())
148
+ print("phonemes of input text: %s are [%s]" % (input_text, phones))
149
+
150
+ speech_tokens_info = t2u.generate_for_long_input_text(
151
+ [phones], max_segment_len=MAX_SEG_LEN
152
+ )
153
+
154
+ for infor in speech_tokens_info[0]:
155
+ print(" ".join(infor) + "\n")
156
+ ```
157
+ ### Detokenization
158
+ ```python
159
+ import sys
160
+ import soundfile as sf
161
+
162
+ dstk_path = "/path/to/DSTK"
163
+ sys.path.append(dstk_path)
164
+
165
+ from semantic_detokenizer.chunk_infer import SpeechDetokenizer
166
+
167
+ # 从离散语音token重建语音波形
168
+ input_tokens = "3953 3890 3489 456 2693 3239 3692 3810 3874 3882 2749 548 3202 4012 3490 3939 3988 411 722 826 2812 3883 3874 3810 3983 4086 3946 3747 3469 2537 3689 3434 1816 1242 2415 3942 3363 3865 2841 1700 1652 3241 3362 3363 3874 3882 2792 933 2253 2799 3692 3746 3882 2809 1001 2449 1016 3762 3882 3874 3810 3809 3983 4086 4018 3747 3461 2537 3624 3882 3382 581 1837 2413 3435 4005 2003 2890 3884 3690 3746 3938 3874 3873 3856"
169
+ vocoder_path = "/path/to/vocos-mel-24khz"
170
+ ref_wav_path = dstk_path + "/004892.wav"
171
+ # output of tokenizer given ref_wav as input
172
+ ref_tokens = "3936 3872 3809 3873 3817 3639 2591 539 1021 3641 3890 4069 2002 3537 2303 3773 3827 3875 3969 4072 2425 97 2537 3633 3690 3865 3920 3069 3582 3883 3818 3997 4031 4029 3946 3874 3733 3727 3214 506 3892 3787 3457 3552 3490 4014 991 1991 3885 3947 4069 1488 1016 3258 3710 52 2362 3961 2680 1569 1851 3897 3825 3752 3808 3800 3873 3808 3792"
173
+
174
+ token_chunk_len = 75
175
+ chunk_cond_proportion = 0.3
176
+ chunk_look_ahead = 10
177
+ max_ref_duration = 4.5
178
+ ref_audio_cut_from_head = False
179
+
180
+ detoker = SpeechDetokenizer(
181
+ vocoder_path=vocoder_path,
182
+ )
183
+
184
+ generated_wave, target_sample_rate = detoker.chunk_generate(
185
+ ref_wav_path,
186
+ ref_tokens.split(),
187
+ input_tokens.split(),
188
+ token_chunk_len,
189
+ chunk_cond_proportion,
190
+ chunk_look_ahead,
191
+ max_ref_duration,
192
+ ref_audio_cut_from_head,
193
+ )
194
+
195
+ with open("./detok.wav", "wb") as f:
196
+ sf.write(f.name, generated_wave, target_sample_rate)
197
+ print(f"write output to: {f.name}")
198
  ```
199
 
200
 
semantic_detokenizer/README.md CHANGED
@@ -2,7 +2,7 @@
2
  #### Our detokenizer is developed based on the [F5-TTS](https://github.com/SWivid/F5-TTS) framework and features two specific improvements.
3
 
4
  1. The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
5
- <p align="center"><img src="../figs/CADiT.jpg" height="600"></p>
6
 
7
  2. A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.
8
  <p align="center"><img src="../figs/F5-streaming.jpg" width="1200"></p>
 
2
  #### Our detokenizer is developed based on the [F5-TTS](https://github.com/SWivid/F5-TTS) framework and features two specific improvements.
3
 
4
  1. The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
5
+ <p align="center"><img src="../figs/CADiT.jpg" height="500"></p>
6
 
7
  2. A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.
8
  <p align="center"><img src="../figs/F5-streaming.jpg" width="1200"></p>
semantic_detokenizer/chunk_infer.py CHANGED
@@ -19,6 +19,7 @@ import os
19
  from datetime import datetime
20
  from importlib.resources import files
21
  from pathlib import Path
 
22
  import tqdm
23
 
24
  import soundfile as sf
@@ -33,6 +34,8 @@ from f5_tts.infer.utils_infer import (
33
  load_vocoder,
34
  remove_silence_for_generated_wav,
35
  )
 
 
36
  from utils_infer import (
37
  mel_spec_type,
38
  target_rms,
 
19
  from datetime import datetime
20
  from importlib.resources import files
21
  from pathlib import Path
22
+ import sys
23
  import tqdm
24
 
25
  import soundfile as sf
 
34
  load_vocoder,
35
  remove_silence_for_generated_wav,
36
  )
37
+
38
+ sys.path.append(str(Path(__file__).parent))
39
  from utils_infer import (
40
  mel_spec_type,
41
  target_rms,
semantic_tokenizer/f40ms/README.md CHANGED
@@ -1,10 +1,11 @@
1
  ## Speech Semantic Tokenizer
2
- As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1.
3
  <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
4
 
5
 
6
  To run this semantic tokenizer alone, the required packages should be installed.
7
  ```bash
8
- # install requirements for this semantic tokenizer
 
9
  pip install -r requirements_npu.txt
10
  ```
 
1
  ## Speech Semantic Tokenizer
2
+ As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq).
3
  <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
4
 
5
 
6
  To run this semantic tokenizer alone, the required packages should be installed.
7
  ```bash
8
+ # install requirements for this semantic tokenizer on Ascend 910B
9
+ # for GPUs, just remove torch-npu==2.5.1
10
  pip install -r requirements_npu.txt
11
  ```
thirdparty/G2P/patch_for_deps.sh CHANGED
@@ -1,4 +1,4 @@
1
- SITE_PATH=/home/ma-user/anaconda3/envs/token/lib/python3.10/site-packages
2
  # fix bugs for LangSegment 0.2.0
3
  sed -i -r 's/,setLangfilters,getLangfilters//' $SITE_PATH/LangSegment/__init__.py
4
  # patch for pypinyin
 
1
+ SITE_PATH=$HOME/.conda/envs/token/lib/python3.10/site-packages
2
  # fix bugs for LangSegment 0.2.0
3
  sed -i -r 's/,setLangfilters,getLangfilters//' $SITE_PATH/LangSegment/__init__.py
4
  # patch for pypinyin
thirdparty/G2P/whitelist/english/new_tts.tsv CHANGED
@@ -4963,3 +4963,4 @@ z. y. zy
4963
  z.y. zy
4964
  z. z. zz
4965
  z.z. zz
 
 
4963
  z.y. zy
4964
  z. z. zz
4965
  z.z. zz
4966
+ dstk d s t k