add example codes in README, refine README, add DSTK to whitelist

Browse files

Files changed (7) hide show

README.md +152 -1
README_CN.md +152 -1
semantic_detokenizer/README.md +1 -1
semantic_detokenizer/chunk_infer.py +3 -0
semantic_tokenizer/f40ms/README.md +3 -2
thirdparty/G2P/patch_for_deps.sh +1 -1
thirdparty/G2P/whitelist/english/new_tts.tsv +1 -0

README.md CHANGED Viewed

@@ -32,7 +32,7 @@ As shown in the figure below, the 3 module could form a pipeline for TTS task.
 As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
 <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
-These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
@@ -46,12 +46,163 @@ We also evaluated the ASR performance of our semantic tokenizer using a LLM as b
 ## Installation
 ### Create a separate environment if needed
 ```bash
 # Create a conda env with python_version>=3.10  (you could also use virtualenv)
 conda create -n dstk python=3.10
 conda activate dstk
 ```
 ## More tools to be release:

 As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
 <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
+These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and less supervised data for training:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
 ## Installation
+### Hardware: Ascend 910B with CANN 8.1 RC1 or GPU
 ### Create a separate environment if needed
 ```bash
 # Create a conda env with python_version>=3.10  (you could also use virtualenv)
 conda create -n dstk python=3.10
 conda activate dstk
+# run install_requirements.sh to setup enviroment for DSTK inference for Ascend 910B
+# for GPUs, just remove torch-npu==2.5.1 from requirements_npu.txt
+sh install_requirements.sh
+# patch for G2P
+# modify the first line in thirdparty/G2P/patch_for_deps.sh:
+# SITE_PATH=/path/to/your/own/site-packages
+# run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
+sh thirdparty/G2P/patch_for_deps.sh
+```
+### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
+## Usage:
+### Pipelines
+```python
+import sys
+import soundfile as sf
+dstk_path = "/path/to/DSTK"
+sys.path.append(dstk_path)
+from reconstuction_example import ReconstructionPipeline
+from tts_example import TTSPipeline
+ref_wav_path = dstk_path + "/00004557-00000030.wav"
+input_wav_path = dstk_path + "/004892.wav"
+vocoder_path = "/path/to/vocos-mel-24khz"
+reconsturctor = ReconstructionPipeline(
+    detok_vocoder=vocoder_path,
+)
+tts = TTSPipeline(
+    detok_vocoder=vocoder_path,
+    max_seg_len=30,
+)
+# for non-parallel speech reconstruction
+generated_wave, target_sample_rate = reconsturctor.reconstruct(
+    ref_wav_path, input_wav_path
+)
+with open("./recon.wav", "wb") as f:
+    sf.write(f.name, generated_wave, target_sample_rate)
+    print(f"write output to: {f.name}")
+# for tts
+ref_wav_path = input_wav_path
+generated_wave, target_sample_rate = tts.synthesize(
+    ref_wav_path,
+    "荷花未全谢，又到中秋节。家家户户把月饼切，庆中秋。美酒多欢乐，整杯盘，猜拳行令，同赏月。",
+)
+with open("./tts.wav", "wb") as f:
+    sf.write(f.name, generated_wave, target_sample_rate)
+    print(f"write output to: {f.name}")
+print("Finished")
+```
+### Tokenization
+```python
+import sys
+import librosa
+dstk_path = "/path/to/DSTK"
+sys.path.append(dstk_path)
+input_wav_path = dstk_path + "/004892.wav"
+from semantic_tokenizer.f40ms.simple_tokenizer_infer import SpeechTokenizer
+tokenizer = SpeechTokenizer()
+raw_wav, sr = librosa.load(input_wav_path, sr=16000)
+token_list, token_info_list = tokenizer.extract([raw_wav])  # 传入波形数据
+for token_info in token_info_list:
+    print(token_info["unit_sequence"] + "\n")
+    print(token_info["reduced_unit_sequence"] + "\n")
+```
+### Text2Token
+```python
+import sys
+import librosa
+dstk_path = "/path/to/DSTK"
+sys.path.append(dstk_path)
+from text2token.simple_infer import Text2TokenGenerator
+input_text = "从离散语音token重建语音波形"
+MAX_SEG_LEN = 30
+t2u = Text2TokenGenerator()
+phones = t2u.text2phone(input_text.strip())
+print("phonemes of input text: %s are [%s]" % (input_text, phones))
+speech_tokens_info = t2u.generate_for_long_input_text(
+    [phones], max_segment_len=MAX_SEG_LEN
+)
+for infor in speech_tokens_info[0]:
+    print(" ".join(infor) + "\n")
+```
+### Detokenization
+```python
+import sys
+import soundfile as sf
+dstk_path = "/path/to/DSTK"
+sys.path.append(dstk_path)
+from semantic_detokenizer.chunk_infer import SpeechDetokenizer
+# 从离散语音token重建语音波形
+input_tokens = "3953 3890 3489 456 2693 3239 3692 3810 3874 3882 2749 548 3202 4012 3490 3939 3988 411 722 826 2812 3883 3874 3810 3983 4086 3946 3747 3469 2537 3689 3434 1816 1242 2415 3942 3363 3865 2841 1700 1652 3241 3362 3363 3874 3882 2792 933 2253 2799 3692 3746 3882 2809 1001 2449 1016 3762 3882 3874 3810 3809 3983 4086 4018 3747 3461 2537 3624 3882 3382 581 1837 2413 3435 4005 2003 2890 3884 3690 3746 3938 3874 3873 3856"
+vocoder_path = "/path/to/vocos-mel-24khz"
+ref_wav_path = dstk_path + "/004892.wav"
+# output of tokenizer given ref_wav as input
+ref_tokens = "3936 3872 3809 3873 3817 3639 2591 539 1021 3641 3890 4069 2002 3537 2303 3773 3827 3875 3969 4072 2425 97 2537 3633 3690 3865 3920 3069 3582 3883 3818 3997 4031 4029 3946 3874 3733 3727 3214 506 3892 3787 3457 3552 3490 4014 991 1991 3885 3947 4069 1488 1016 3258 3710 52 2362 3961 2680 1569 1851 3897 3825 3752 3808 3800 3873 3808 3792"
+token_chunk_len = 75
+chunk_cond_proportion = 0.3
+chunk_look_ahead = 10
+max_ref_duration = 4.5
+ref_audio_cut_from_head = False
+detoker = SpeechDetokenizer(
+    vocoder_path=vocoder_path,
+)
+generated_wave, target_sample_rate = detoker.chunk_generate(
+    ref_wav_path,
+    ref_tokens.split(),
+    input_tokens.split(),
+    token_chunk_len,
+    chunk_cond_proportion,
+    chunk_look_ahead,
+    max_ref_duration,
+    ref_audio_cut_from_head,
+)
+with open("./detok.wav", "wb") as f:
+    sf.write(f.name, generated_wave, target_sample_rate)
+    print(f"write output to: {f.name}")
 ```
 ## More tools to be release:

README_CN.md CHANGED Viewed

@@ -23,7 +23,7 @@ V1.0
 串联使用tokenizer和detokenizer实现语音重建的功能
 <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
-上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
@@ -38,12 +38,163 @@ V1.0
 ## Installation
 ### Create a separate environment if needed
 ```bash
 # Create a conda env with python_version>=3.10  (you could also use virtualenv)
 conda create -n dstk python=3.10
 conda activate dstk
 ```

 串联使用tokenizer和detokenizer实现语音重建的功能
 <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
+上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平，二模型的参数量和使用的监督数据都远小于对照基线模型:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
 ## Installation
+### Hardware: Ascend 910B with CANN 8.1 RC1 or GPU
 ### Create a separate environment if needed
 ```bash
 # Create a conda env with python_version>=3.10  (you could also use virtualenv)
 conda create -n dstk python=3.10
 conda activate dstk
+# run install_requirements.sh to setup enviroment for DSTK inference for Ascend 910B
+# for GPUs, just remove torch-npu==2.5.1 from requirements_npu.txt
+sh install_requirements.sh
+# patch for G2P
+# modify the first line in thirdparty/G2P/patch_for_deps.sh:
+# SITE_PATH=/path/to/your/own/site-packages
+# run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
+sh thirdparty/G2P/patch_for_deps.sh
+```
+### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
+## Usage:
+### Pipelines
+```python
+import sys
+import soundfile as sf
+dstk_path = "/path/to/DSTK"
+sys.path.append(dstk_path)
+from reconstuction_example import ReconstructionPipeline
+from tts_example import TTSPipeline
+ref_wav_path = dstk_path + "/00004557-00000030.wav"
+input_wav_path = dstk_path + "/004892.wav"
+vocoder_path = "/path/to/vocos-mel-24khz"
+reconsturctor = ReconstructionPipeline(
+    detok_vocoder=vocoder_path,
+)
+tts = TTSPipeline(
+    detok_vocoder=vocoder_path,
+    max_seg_len=30,
+)
+# for non-parallel speech reconstruction
+generated_wave, target_sample_rate = reconsturctor.reconstruct(
+    ref_wav_path, input_wav_path
+)
+with open("./recon.wav", "wb") as f:
+    sf.write(f.name, generated_wave, target_sample_rate)
+    print(f"write output to: {f.name}")
+# for tts
+ref_wav_path = input_wav_path
+generated_wave, target_sample_rate = tts.synthesize(
+    ref_wav_path,
+    "荷花未全谢，又到中秋节。家家户户把月饼切，庆中秋。美酒多欢乐，整杯盘，猜拳行令，同赏月。",
+)
+with open("./tts.wav", "wb") as f:
+    sf.write(f.name, generated_wave, target_sample_rate)
+    print(f"write output to: {f.name}")
+print("Finished")
+```
+### Tokenization
+```python
+import sys
+import librosa
+dstk_path = "/path/to/DSTK"
+sys.path.append(dstk_path)
+input_wav_path = dstk_path + "/004892.wav"
+from semantic_tokenizer.f40ms.simple_tokenizer_infer import SpeechTokenizer
+tokenizer = SpeechTokenizer()
+raw_wav, sr = librosa.load(input_wav_path, sr=16000)
+token_list, token_info_list = tokenizer.extract([raw_wav])  # 传入波形数据
+for token_info in token_info_list:
+    print(token_info["unit_sequence"] + "\n")
+    print(token_info["reduced_unit_sequence"] + "\n")
+```
+### Text2Token
+```python
+import sys
+import librosa
+dstk_path = "/path/to/DSTK"
+sys.path.append(dstk_path)
+from text2token.simple_infer import Text2TokenGenerator
+input_text = "从离散语音token重建语音波形"
+MAX_SEG_LEN = 30
+t2u = Text2TokenGenerator()
+phones = t2u.text2phone(input_text.strip())
+print("phonemes of input text: %s are [%s]" % (input_text, phones))
+speech_tokens_info = t2u.generate_for_long_input_text(
+    [phones], max_segment_len=MAX_SEG_LEN
+)
+for infor in speech_tokens_info[0]:
+    print(" ".join(infor) + "\n")
+```
+### Detokenization
+```python
+import sys
+import soundfile as sf
+dstk_path = "/path/to/DSTK"
+sys.path.append(dstk_path)
+from semantic_detokenizer.chunk_infer import SpeechDetokenizer
+# 从离散语音token重建语音波形
+input_tokens = "3953 3890 3489 456 2693 3239 3692 3810 3874 3882 2749 548 3202 4012 3490 3939 3988 411 722 826 2812 3883 3874 3810 3983 4086 3946 3747 3469 2537 3689 3434 1816 1242 2415 3942 3363 3865 2841 1700 1652 3241 3362 3363 3874 3882 2792 933 2253 2799 3692 3746 3882 2809 1001 2449 1016 3762 3882 3874 3810 3809 3983 4086 4018 3747 3461 2537 3624 3882 3382 581 1837 2413 3435 4005 2003 2890 3884 3690 3746 3938 3874 3873 3856"
+vocoder_path = "/path/to/vocos-mel-24khz"
+ref_wav_path = dstk_path + "/004892.wav"
+# output of tokenizer given ref_wav as input
+ref_tokens = "3936 3872 3809 3873 3817 3639 2591 539 1021 3641 3890 4069 2002 3537 2303 3773 3827 3875 3969 4072 2425 97 2537 3633 3690 3865 3920 3069 3582 3883 3818 3997 4031 4029 3946 3874 3733 3727 3214 506 3892 3787 3457 3552 3490 4014 991 1991 3885 3947 4069 1488 1016 3258 3710 52 2362 3961 2680 1569 1851 3897 3825 3752 3808 3800 3873 3808 3792"
+token_chunk_len = 75
+chunk_cond_proportion = 0.3
+chunk_look_ahead = 10
+max_ref_duration = 4.5
+ref_audio_cut_from_head = False
+detoker = SpeechDetokenizer(
+    vocoder_path=vocoder_path,
+)
+generated_wave, target_sample_rate = detoker.chunk_generate(
+    ref_wav_path,
+    ref_tokens.split(),
+    input_tokens.split(),
+    token_chunk_len,
+    chunk_cond_proportion,
+    chunk_look_ahead,
+    max_ref_duration,
+    ref_audio_cut_from_head,
+)
+with open("./detok.wav", "wb") as f:
+    sf.write(f.name, generated_wave, target_sample_rate)
+    print(f"write output to: {f.name}")
 ```

semantic_detokenizer/README.md CHANGED Viewed

@@ -2,7 +2,7 @@
 #### Our detokenizer is developed based on the [F5-TTS](https://github.com/SWivid/F5-TTS) framework and features two specific improvements.
 1. The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
-<p align="center"><img src="../figs/CADiT.jpg" height="600"></p>
 2. A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.
 <p align="center"><img src="../figs/F5-streaming.jpg" width="1200"></p>

 #### Our detokenizer is developed based on the [F5-TTS](https://github.com/SWivid/F5-TTS) framework and features two specific improvements.
 1. The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
+<p align="center"><img src="../figs/CADiT.jpg" height="500"></p>
 2. A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.
 <p align="center"><img src="../figs/F5-streaming.jpg" width="1200"></p>

semantic_detokenizer/chunk_infer.py CHANGED Viewed

@@ -19,6 +19,7 @@ import os
 from datetime import datetime
 from importlib.resources import files
 from pathlib import Path
 import tqdm
 import soundfile as sf
@@ -33,6 +34,8 @@ from f5_tts.infer.utils_infer import (
     load_vocoder,
     remove_silence_for_generated_wav,
 )
 from utils_infer import (
     mel_spec_type,
     target_rms,

 from datetime import datetime
 from importlib.resources import files
 from pathlib import Path
+import sys
 import tqdm
 import soundfile as sf
     load_vocoder,
     remove_silence_for_generated_wav,
 )
+sys.path.append(str(Path(__file__).parent))
 from utils_infer import (
     mel_spec_type,
     target_rms,

semantic_tokenizer/f40ms/README.md CHANGED Viewed

@@ -1,10 +1,11 @@
 ## Speech Semantic Tokenizer
-As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1.
 <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
 To run this semantic tokenizer alone, the required packages should be installed.
 ```bash
-# install requirements for this semantic tokenizer
 pip install -r requirements_npu.txt
 ```

 ## Speech Semantic Tokenizer
+As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq).
 <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
 To run this semantic tokenizer alone, the required packages should be installed.
 ```bash
+# install requirements for this semantic tokenizer on Ascend 910B
+# for GPUs, just remove torch-npu==2.5.1
 pip install -r requirements_npu.txt
 ```

thirdparty/G2P/patch_for_deps.sh CHANGED Viewed

@@ -1,4 +1,4 @@
-SITE_PATH=/home/ma-user/anaconda3/envs/token/lib/python3.10/site-packages
 # fix bugs for LangSegment 0.2.0
 sed -i -r 's/,setLangfilters,getLangfilters//' $SITE_PATH/LangSegment/__init__.py
 # patch for pypinyin

+SITE_PATH=$HOME/.conda/envs/token/lib/python3.10/site-packages
 # fix bugs for LangSegment 0.2.0
 sed -i -r 's/,setLangfilters,getLangfilters//' $SITE_PATH/LangSegment/__init__.py
 # patch for pypinyin

thirdparty/G2P/whitelist/english/new_tts.tsv CHANGED Viewed

@@ -4963,3 +4963,4 @@ z. y.	zy
 z.y.	zy
 z. z.	zz
 z.z.	zz

 z.y.	zy
 z. z.	zz
 z.z.	zz
+dstk	d s t k