Prince-1 commited on
Commit
a847b65
·
verified ·
1 Parent(s): ba8fb51

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer_config.json filter=lfs diff=lfs merge=lfs -text
38
+ model.onnx.data filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - zh
5
+ - en
6
+ base_model:
7
+ - meta-llama/Llama-3.2-3B-Instruct
8
+ - unsloth/Llasa-3B
9
+ tags:
10
+ - Text-to-Speech
11
+ - onnx
12
+ - onnxruntime-genai
13
+ - onnxruntime
14
+ pipeline_tag: text-to-speech
15
+ library_name: onnxruntime-genai
16
+ base_model_relation: quantized
17
+ ---
18
+ <div>
19
+ <p style="margin-bottom: 0; margin-top: 0;">
20
+ <strong>See <a href="https://huggingface.co/collections/unsloth/text-to-speech-tts-models-68007ab12522e96be1e02155">our collection</a> for all our TTS model uploads.</strong>
21
+ </p>
22
+ <p style="margin-bottom: 0;">
23
+ <em>Learn to fine-tune TTS models - <a href="https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning">Read our Guide</a>.</em>
24
+ </p>
25
+ <p style="margin-top: 0;margin-bottom: 0;">
26
+ <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em>
27
+ </p>
28
+ <div style="display: flex; gap: 5px; align-items: center; ">
29
+ <a href="https://github.com/unslothai/unsloth/">
30
+ <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
31
+ </a>
32
+ <a href="https://discord.gg/unsloth">
33
+ <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
34
+ </a>
35
+ <a href="https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning">
36
+ <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
37
+ </a>
38
+ </div>
39
+ <h1 style="margin-top: 0rem;">✨ Run & Fine-tune TTS models with Unsloth!</h1>
40
+ </div>
41
+
42
+ - Fine-tune TTS models for free using our Google [Colab notebooks here](https://docs.unsloth.ai/get-started/unsloth-notebooks#text-to-speech-tts-notebooks)!
43
+ - Read our Blog about TTS support: [unsloth.ai/blog/tts](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning)
44
+
45
+ | Unsloth supports | Free Notebooks | Performance | Memory use |
46
+ |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
47
+ | **Llasa-3B** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llasa_TTS_(3B).ipynb) | 1.5x faster | 58% less |
48
+ | **Whisper Large V3** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Whisper.ipynb) | 1.5x faster | 50% less |
49
+ | **Qwen3 (14B)** | [▶️ Start on Colab](https://docs.unsloth.ai/get-started/unsloth-notebooks) | 2x faster | 70% less |
50
+ | **Llama 3.2 Vision (11B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 1.8x faster | 50% less |
51
+
52
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2502.04128)
53
+
54
+ **Update (2025-05-10):** Sometimes I find that top_p=0.95 and temperature=0.9 produce more stable results.
55
+
56
+
57
+ **Update (2025-02-13):** Add [Llasa finetune instruction](https://github.com/zhenye234/LLaSA_training/tree/main/finetune).
58
+
59
+
60
+ **Update (2025-02-07):** Our paper has been released!
61
+
62
+
63
+ LLaSA: Scaling Train-Time and Inference-Time Compute for LLaMA-based Speech Synthesis
64
+
65
+
66
+ - **Train from Scratch**: If you want to train the model from scratch, use the [LLaSA Training Repository](https://github.com/zhenye234/LLaSA_training).
67
+
68
+ - **Scale for Test-Time Computation**: If you want to experiment with scaling for test-time computation, use the [LLaSA Testing Repository](https://github.com/zhenye234/LLaSA_inference).
69
+
70
+ ## Model Information
71
+ Our model, Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook,
72
+ which contains 65,536 tokens. We trained Llasa on a dataset comprising 250,000 hours of Chinese-English speech data.
73
+ The model is capable of generating speech **either solely from input text or by utilizing a given speech prompt.**
74
+
75
+ The method is seamlessly compatible with the Llama framework, making training TTS similar as training LLM (convert audios into single-codebook tokens and simply view it as a special language). It opens the possiblity of existing method for compression, acceleration and finetuning for LLM to be applied.
76
+
77
+
78
+
79
+ ## How to use
80
+ Install [XCodec2](https://huggingface.co/HKUSTAudio/xcodec2).
81
+
82
+ **1. Speech synthesis solely from input text**
83
+ ```python
84
+ from transformers import AutoTokenizer, AutoModelForCausalLM
85
+ import torch
86
+ import soundfile as sf
87
+
88
+ llasa_3b ='HKUSTAudio/Llasa-3B'
89
+
90
+ tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
91
+ model = AutoModelForCausalLM.from_pretrained(llasa_3b)
92
+ model.eval()
93
+ model.to('cuda')
94
+
95
+ from xcodec2.modeling_xcodec2 import XCodec2Model
96
+
97
+ model_path = "HKUSTAudio/xcodec2"
98
+
99
+ Codec_model = XCodec2Model.from_pretrained(model_path)
100
+ Codec_model.eval().cuda()
101
+
102
+ input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.'
103
+ # input_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
104
+ def ids_to_speech_tokens(speech_ids):
105
+
106
+ speech_tokens_str = []
107
+ for speech_id in speech_ids:
108
+ speech_tokens_str.append(f"<|s_{speech_id}|>")
109
+ return speech_tokens_str
110
+
111
+ def extract_speech_ids(speech_tokens_str):
112
+
113
+ speech_ids = []
114
+ for token_str in speech_tokens_str:
115
+ if token_str.startswith('<|s_') and token_str.endswith('|>'):
116
+ num_str = token_str[4:-2]
117
+
118
+ num = int(num_str)
119
+ speech_ids.append(num)
120
+ else:
121
+ print(f"Unexpected token: {token_str}")
122
+ return speech_ids
123
+
124
+ #TTS start!
125
+ with torch.no_grad():
126
+
127
+ formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
128
+
129
+ # Tokenize the text
130
+ chat = [
131
+ {"role": "user", "content": "Convert the text to speech:" + formatted_text},
132
+ {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
133
+ ]
134
+
135
+ input_ids = tokenizer.apply_chat_template(
136
+ chat,
137
+ tokenize=True,
138
+ return_tensors='pt',
139
+ continue_final_message=True
140
+ )
141
+ input_ids = input_ids.to('cuda')
142
+ speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
143
+
144
+ # Generate the speech autoregressively
145
+ outputs = model.generate(
146
+ input_ids,
147
+ max_length=2048, # We trained our model with a max length of 2048
148
+ eos_token_id= speech_end_id ,
149
+ do_sample=True,
150
+ top_p=1, # Adjusts the diversity of generated content
151
+ temperature=0.8, # Controls randomness in output
152
+ )
153
+ # Extract the speech tokens
154
+ generated_ids = outputs[0][input_ids.shape[1]:-1]
155
+
156
+ speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
157
+
158
+ # Convert token <|s_23456|> to int 23456
159
+ speech_tokens = extract_speech_ids(speech_tokens)
160
+
161
+ speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
162
+
163
+ # Decode the speech tokens to speech waveform
164
+ gen_wav = Codec_model.decode_code(speech_tokens)
165
+
166
+
167
+ sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
168
+ ```
169
+
170
+ **2. Speech synthesis utilizing a given speech prompt**
171
+
172
+ ```python
173
+ from transformers import AutoTokenizer, AutoModelForCausalLM
174
+ import torch
175
+ import soundfile as sf
176
+
177
+ llasa_3b ='HKUSTAudio/Llasa-3B'
178
+
179
+ tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
180
+ model = AutoModelForCausalLM.from_pretrained(llasa_3b)
181
+ model.eval()
182
+ model.to('cuda')
183
+
184
+ from xcodec2.modeling_xcodec2 import XCodec2Model
185
+
186
+ model_path = "HKUSTAudio/xcodec2"
187
+
188
+ Codec_model = XCodec2Model.from_pretrained(model_path)
189
+ Codec_model.eval().cuda()
190
+ # only 16khz speech support!
191
+ prompt_wav, sr = sf.read("太乙真人.wav") # you can find wav in Files
192
+ #prompt_wav, sr = sf.read("Anna.wav") # English prompt
193
+ prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)
194
+
195
+ prompt_text ="对,这就是我万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。"
196
+ #promt_text = "A chance to leave him alone, but... No. She just wanted to see him again. Anna, you don't know how it feels to lose a sister. Anna, I'm sorry, but your father asked me not to tell you anything."
197
+ target_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
198
+ #target_text = "Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me."
199
+ input_text = prompt_text + target_text
200
+
201
+ def ids_to_speech_tokens(speech_ids):
202
+
203
+ speech_tokens_str = []
204
+ for speech_id in speech_ids:
205
+ speech_tokens_str.append(f"<|s_{speech_id}|>")
206
+ return speech_tokens_str
207
+
208
+ def extract_speech_ids(speech_tokens_str):
209
+
210
+ speech_ids = []
211
+ for token_str in speech_tokens_str:
212
+ if token_str.startswith('<|s_') and token_str.endswith('|>'):
213
+ num_str = token_str[4:-2]
214
+
215
+ num = int(num_str)
216
+ speech_ids.append(num)
217
+ else:
218
+ print(f"Unexpected token: {token_str}")
219
+ return speech_ids
220
+
221
+ #TTS start!
222
+ with torch.no_grad():
223
+ # Encode the prompt wav
224
+ vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
225
+ print("Prompt Vq Code Shape:", vq_code_prompt.shape )
226
+
227
+ vq_code_prompt = vq_code_prompt[0,0,:]
228
+ # Convert int 12345 to token <|s_12345|>
229
+ speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)
230
+
231
+ formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
232
+
233
+ # Tokenize the text and the speech prefix
234
+ chat = [
235
+ {"role": "user", "content": "Convert the text to speech:" + formatted_text},
236
+ {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
237
+ ]
238
+
239
+ input_ids = tokenizer.apply_chat_template(
240
+ chat,
241
+ tokenize=True,
242
+ return_tensors='pt',
243
+ continue_final_message=True
244
+ )
245
+ input_ids = input_ids.to('cuda')
246
+ speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
247
+
248
+ # Generate the speech autoregressively
249
+ outputs = model.generate(
250
+ input_ids,
251
+ max_length=2048, # We trained our model with a max length of 2048
252
+ eos_token_id= speech_end_id ,
253
+ do_sample=True,
254
+ top_p=1,
255
+ temperature=0.8,
256
+ )
257
+ # Extract the speech tokens
258
+ generated_ids = outputs[0][input_ids.shape[1]-len(speech_ids_prefix):-1]
259
+
260
+ speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
261
+
262
+ # Convert token <|s_23456|> to int 23456
263
+ speech_tokens = extract_speech_ids(speech_tokens)
264
+
265
+ speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
266
+
267
+ # Decode the speech tokens to speech waveform
268
+ gen_wav = Codec_model.decode_code(speech_tokens)
269
+
270
+ # if only need the generated part
271
+ # gen_wav = gen_wav[:,:,prompt_wav.shape[1]:]
272
+
273
+ sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
274
+ ```
275
+
276
+
277
+ ## Disclaimer
278
+
279
+ This model is licensed under the CC BY-NC 4.0 License, which prohibits free commercial use because of ethics and privacy concerns; detected violations will result in legal consequences.
280
+
281
+ This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.
genai_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": {
3
+ "bos_token_id": 128000,
4
+ "context_length": 131072,
5
+ "decoder": {
6
+ "session_options": {
7
+ "log_id": "onnxruntime-genai",
8
+ "provider_options": []
9
+ },
10
+ "filename": "model.onnx",
11
+ "head_size": 128,
12
+ "hidden_size": 3072,
13
+ "inputs": {
14
+ "input_ids": "input_ids",
15
+ "attention_mask": "attention_mask",
16
+ "position_ids": "position_ids",
17
+ "past_key_names": "past_key_values.%d.key",
18
+ "past_value_names": "past_key_values.%d.value"
19
+ },
20
+ "outputs": {
21
+ "logits": "logits",
22
+ "present_key_names": "present.%d.key",
23
+ "present_value_names": "present.%d.value"
24
+ },
25
+ "num_attention_heads": 24,
26
+ "num_hidden_layers": 28,
27
+ "num_key_value_heads": 8
28
+ },
29
+ "eos_token_id": [
30
+ 128001,
31
+ 128008,
32
+ 128009
33
+ ],
34
+ "pad_token_id": 128001,
35
+ "type": "llama",
36
+ "vocab_size": 193800
37
+ },
38
+ "search": {
39
+ "diversity_penalty": 0.0,
40
+ "do_sample": true,
41
+ "early_stopping": true,
42
+ "length_penalty": 1.0,
43
+ "max_length": 131072,
44
+ "min_length": 0,
45
+ "no_repeat_ngram_size": 0,
46
+ "num_beams": 1,
47
+ "num_return_sequences": 1,
48
+ "past_present_share_buffer": false,
49
+ "repetition_penalty": 1.0,
50
+ "temperature": 0.6,
51
+ "top_k": 1,
52
+ "top_p": 0.9
53
+ }
54
+ }
model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:440b8e8c6dfd9c5b562c4255c797d44e661d1b39d1b483ed1926d768a82bdfa3
3
+ size 653885
model.onnx.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7fa1bdeecda5008855cb6bfde5efb9fb3117dec297534de1d609062e6a0ddee0
3
+ size 8052463616
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|begin_of_text|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|eot_id|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|eot_id|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71d92f3dbf3c23d734e6356241cef149b42fe79848176a54145b6f9a886fd73b
3
+ size 29521206
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9cf6f8d6e3395f40bf6881f92e621e10e47aae25f5f090052adafb83bdc75661
3
+ size 11710454