fnlp
/

ZhangXInFD commited on
Commit
793c3ac
1 Parent(s): 7db6ee9

First model version

Browse files
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
2
+
3
+ <a href='https://github.com/ZhangXInFD/SpeechTokenizer'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2308.16692'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
4
+
5
+ ## Introduction
6
+ This is the code for the SpeechTokenizer presented in the [SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models](https://0nutation.github.io/SpeechTokenizer.github.io/). SpeechTokenizer is a unified speech tokenizer for speech large language models, which adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Specifically, The code indices that the first quantizer of RVQ outputs can be considered as semantic tokens and the output of the remaining quantizers can be regarded as acoustic tokens, which serve as supplements for the information lost by the first quantizer. We provide our models:
7
+ * A model operated at 16khz on monophonic speech trained on Librispeech with average representation across all HuBERT layers as semantic teacher.
8
+
9
+ <br>
10
+ <p align="center">
11
+ <img src="images/overview.png" width="95%"> <br>
12
+ Overview
13
+ </p>
14
+ <p align="center">
15
+ <img src="images/speechtokenizer_framework.jpg" width="95%"> <br>
16
+ The SpeechTokenizer framework.
17
+ </p>
18
+ <br>
19
+
20
+
21
+ Welcome to try our [SLMTokBench](https://github.com/0nutation/SLMTokBench)
22
+ and we will also open source our [USLM](https://github.com/0nutation/USLM) !!
23
+
24
+
25
+
26
+ ## Samples
27
+
28
+ Samples are provided on [our demo page](https://0nutation.github.io/SpeechTokenizer.github.io/).
29
+
30
+ ## Installation
31
+
32
+ SpeechTokenizer requires Python>=3.8, and a reasonly recent version of PyTorch.
33
+ To install SpeechTokenizer, you can run from this repository:
34
+ ```bash
35
+ pip install -U speechtokenizer
36
+
37
+ # or you can clone the repo and install locally
38
+ git clone https://github.com/ZhangXInFD/SpeechTokenizer.git
39
+ cd SpeechTokenizer
40
+ pip install .
41
+ ```
42
+ ## Usage
43
+ ### Model storage
44
+ [model list](https://huggingface.co/fnlp/SpeechTokenizer)
45
+ ### load model
46
+ ```python
47
+ from speechtokenizer import SpeechTokenizer
48
+
49
+ config_path = '/path/config.json'
50
+ ckpt_path = '/path/SpeechTokenizer.pt'
51
+ model = SpeechTokenizer.load_from_checkpoint(config_path, ckpt_path)
52
+ model.eval()
53
+ ```
54
+ ### Extracting discrete representions
55
+ ```python
56
+ import torchaudio
57
+ import torch
58
+
59
+ # Load and pre-process speech waveform
60
+ wav, sr = torchaudio.load('<SPEECH_FILE_PATH>')
61
+ if sr != model.sample_rate:
62
+ wav = torchaudio.functional.resample(wav, sr, model.sample_rate)
63
+ wav = wav.unsqueeze(0)
64
+
65
+ # Extract discrete codes from SpeechTokenizer
66
+ with torch.no_grad():
67
+ codes = model.encode(wav) # codes: (n_q, B, T)
68
+
69
+ semantic_tokens = codes[0, :, :]
70
+ acoustic_tokens = codes[1:, :, :]
71
+ ```
72
+
73
+ ### Decoding discrete representions
74
+ ```python
75
+ # Decoding from the first quantizers to ith quantizers
76
+ wav = model.decode(codes[:(i + 1)]) # wav: (B, 1, T)
77
+
78
+ # Decoding from ith quantizers to jth quantizers
79
+ wav = model.decode(codes[i: (j + 1)], st=i)
80
+
81
+ # Cancatenating semantic tokens and acoustic tokens and then decoding
82
+ semantic_tokens = ... # (..., B, T)
83
+ acoustic_tokens = ... # (..., B, T)
84
+ wav = model.decode(torch.cat([semantic_tokens, acoustic_tokens], axis=0))
85
+ ```
86
+
87
+ ## Citation
88
+ If you use this code or result in your paper, please cite our work as:
89
+ ```tex
90
+ @misc{zhang2023speechtokenizer,
91
+ title={SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models},
92
+ author={Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu},
93
+ year={2023},
94
+ eprint={2308.16692},
95
+ archivePrefix={arXiv},
96
+ primaryClass={cs.CL}
97
+ }
98
+ ```
99
+
100
+ ## License
101
+ The code in this repository is released under the Apache 2.0 license as found in the
102
+ [LICENSE](LICENSE) file.
images/READ.me ADDED
@@ -0,0 +1 @@
 
 
1
+
images/overview.png ADDED
images/speechtokenizer_framework.jpg ADDED
speechtokenizer_hubert_avg/SpeechTokenizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d04593b6c9a4b475f91ca481141a6ef5b23e6ac112f347dd2b2717f193c1c728
3
+ size 481906997
speechtokenizer_hubert_avg/config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "resblock": "1",
3
+ "num_gpus": 3,
4
+ "batch_size": 60,
5
+ "learning_rate": 0.0001,
6
+ "adam_b1": 0.5,
7
+ "adam_b2": 0.9,
8
+ "lr_decay": 0.98,
9
+ "seed": 1234,
10
+ "lambda_distill": 0.15,
11
+
12
+ "n_filters": 64,
13
+ "strides": [8,5,4,2],
14
+ "dimension": 1024,
15
+ "semantic_dimension": 768,
16
+ "bidirectional": true,
17
+ "dilation_base": 2,
18
+ "residual_kernel_size": 3,
19
+ "n_residual_layers": 1,
20
+ "lstm_layers": 2,
21
+ "activation": "ELU",
22
+
23
+
24
+ "segment_size": 48000,
25
+ "num_mels": 80,
26
+ "num_freq": 1025,
27
+ "n_fft": 1024,
28
+ "hop_size": 240,
29
+ "win_size": 1024,
30
+
31
+ "sampling_rate": 16000,
32
+ "sample_rate": 16000,
33
+
34
+ "codebook_size": 1024,
35
+ "n_q": 8,
36
+
37
+ "fmin": 0,
38
+ "fmax": 8000,
39
+ "fmax_for_loss": null,
40
+
41
+ "num_workers": 12,
42
+
43
+ "dist_config": {
44
+ "dist_backend": "nccl",
45
+ "dist_url": "tcp://localhost:54322",
46
+ "world_size": 1
47
+ }
48
+ }