Malaysian Text-to-Speech
Collection
Malaysian Text-to-Speech models. • 28 items • Updated • 4
How to use mesolitica/Malaysian-TTS-0.6B-v1 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="mesolitica/Malaysian-TTS-0.6B-v1")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages) # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("mesolitica/Malaysian-TTS-0.6B-v1")
model = AutoModelForCausalLM.from_pretrained("mesolitica/Malaysian-TTS-0.6B-v1")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use mesolitica/Malaysian-TTS-0.6B-v1 with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mesolitica/Malaysian-TTS-0.6B-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "mesolitica/Malaysian-TTS-0.6B-v1",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/mesolitica/Malaysian-TTS-0.6B-v1
How to use mesolitica/Malaysian-TTS-0.6B-v1 with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "mesolitica/Malaysian-TTS-0.6B-v1" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "mesolitica/Malaysian-TTS-0.6B-v1",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "mesolitica/Malaysian-TTS-0.6B-v1" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "mesolitica/Malaysian-TTS-0.6B-v1",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use mesolitica/Malaysian-TTS-0.6B-v1 with Docker Model Runner:
docker model run hf.co/mesolitica/Malaysian-TTS-0.6B-v1
Continue pretraining mesolitica/Malaysian-TTS-0.6B-v0.1 on much consistent dataset,
pip3 install git+https://github.com/mesolitica/DistilCodec
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000
from distilcodec import DistilCodec, demo_for_generate_audio_codes
from transformers import AutoTokenizer, AutoModelForCausalLM
codec_model_config_path='model_config.json'
codec_ckpt_path = 'g_00204000'
codec = DistilCodec.from_pretrained(
config_path=codec_model_config_path,
model_path=codec_ckpt_path,
use_generator=True,
is_debug=False).eval()
tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-0.6B-v1')
model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-0.6B-v1', torch_dtype = 'auto').cuda()
import soundfile as sf
from tqdm import tqdm
speakers = [
'husein',
'idayu',
'singaporean',
'DisfluencySpeech',
'singlish-speaker2050',
'singlish-speaker2202',
'haqkiem',
]
string = 'IC saya adalah, sembilan enam, kosong tiga, satu empat, one, one, one, one, A, B, C, D, and yes, what can I help you sir?'
for s in tqdm(speakers):
left = s +': ' + string
prompt = f'<|im_start|>{left}<|speech_start|>'
generate_kwargs = dict(
**tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'),
max_new_tokens=1024,
temperature=0.7,
do_sample=True,
repetition_penalty=1.1,
)
generation_output = model.generate(**generate_kwargs)
speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[-1].replace('<|endoftext|>', '')
numbers = re.findall(r'speech_(\d+)', speech_token)
d = list(map(int, numbers))
y_gen = codec.decode_from_codes(
d,
minus_token_offset=False
)
sf.write(f'{s}.mp3', y_gen[0, 0].cpu().numpy(), 24000)
Output,
123, you have to normalize it first to become one two three or one hundred twenty three or satu dua tiga or seratus dua puluh tiga. Feel free to use Malaya for normalization, Malaya support Malay and English normalization, read more at https://github.com/mesolitica/malaya/issues/247#issuecomment-3030313021A, A, A, A, B, B in our recordings is spoken as A A A A B B. We have no intention to improve it due to cost, but continue finetune using proper dataset should able to solve it.Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts
Special thanks to https://www.sns.com.my and Nvidia for 1x H100!