Qwen3-TTS CustomVoice โ HF Inference Endpoint
Custom handler.py that serves Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
on a dedicated HF Inference Endpoint via the official qwen-tts package.
The handler loads the 1.9B model + the 12Hz speech tokenizer (vocoder) at cold start and exposes one POST route returning base64 WAV (24kHz mono).
Request
{
"inputs": "She watched the rain trace lines down the window.",
"parameters": {
"speaker": "Ryan",
"language": "English",
"instruct": "calm, observational"
}
}
Zero-shot clone (best-effort on CustomVoice; the -Base model is better for this):
{
"inputs": "text in the cloned voice",
"parameters": {
"language": "English",
"ref_audio_b64": "<base64 wav/mp3>",
"ref_text": "exact transcript of the reference clip"
}
}
Response: [{"audio": "<base64 wav>", "format": "wav", "sample_rate": 24000, "duration_s": 3.1, "speaker": "Ryan", "language": "English"}]
Built-in speakers
| Speaker | Voice | Native language |
|---|---|---|
| Vivian | bright, slightly edgy young female | Chinese |
| Serena | warm, gentle young female | Chinese |
| Uncle_Fu | seasoned male, low mellow timbre | Chinese |
| Dylan | youthful Beijing male | Chinese (Beijing) |
| Eric | lively Chengdu male | Chinese (Sichuan) |
| Ryan | dynamic male, strong rhythm | English |
| Aiden | sunny American male, clear midrange | English |
| Ono_Anna | playful Japanese female | Japanese |
| Sohee | warm Korean female, rich emotion | Korean |
Languages: Chinese, English, Japanese, Korean, German, French, Russian,
Portuguese, Spanish, Italian (or Auto).
Recommended instance
nvidia-l4 x1 (24 GB) is comfortable โ the model is ~4 GB in bf16. Cold start
~3-6 min (pip build + model + vocoder download + load). flash-attn is
deliberately omitted; the handler uses sdpa.
Lifecycle (from the video_lab repo)
python scripts/_run_qwen_tts_endpoint.py --start # create + wait for ready
python scripts/_run_qwen_tts_endpoint.py --status # show state + URL
python scripts/_run_qwen_tts_endpoint.py --stop # delete (full teardown)
Model tree for macso250/qwen3-tts-endpoint
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice