Qwen3-TTS CustomVoice — HF Inference Endpoint

Custom handler.py that serves Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice on a dedicated HF Inference Endpoint via the official qwen-tts package.

The handler loads the 1.9B model + the 12Hz speech tokenizer (vocoder) at cold start and exposes one POST route returning base64 WAV (24kHz mono).

Request

{
  "inputs": "She watched the rain trace lines down the window.",
  "parameters": {
    "speaker": "Ryan",
    "language": "English",
    "instruct": "calm, observational"
  }
}

Zero-shot clone (best-effort on CustomVoice; the -Base model is better for this):

{
  "inputs": "text in the cloned voice",
  "parameters": {
    "language": "English",
    "ref_audio_b64": "<base64 wav/mp3>",
    "ref_text": "exact transcript of the reference clip"
  }
}

Response: [{"audio": "<base64 wav>", "format": "wav", "sample_rate": 24000, "duration_s": 3.1, "speaker": "Ryan", "language": "English"}]

Built-in speakers

Speaker	Voice	Native language
Vivian	bright, slightly edgy young female	Chinese
Serena	warm, gentle young female	Chinese
Uncle_Fu	seasoned male, low mellow timbre	Chinese
Dylan	youthful Beijing male	Chinese (Beijing)
Eric	lively Chengdu male	Chinese (Sichuan)
Ryan	dynamic male, strong rhythm	English
Aiden	sunny American male, clear midrange	English
Ono_Anna	playful Japanese female	Japanese
Sohee	warm Korean female, rich emotion	Korean

Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian (or Auto).

Recommended instance

nvidia-l4 x1 (24 GB) is comfortable — the model is ~4 GB in bf16. Cold start ~3-6 min (pip build + model + vocoder download + load). flash-attn is deliberately omitted; the handler uses sdpa.

Lifecycle (from the video_lab repo)

python scripts/_run_qwen_tts_endpoint.py --start    # create + wait for ready
python scripts/_run_qwen_tts_endpoint.py --status   # show state + URL
python scripts/_run_qwen_tts_endpoint.py --stop     # delete (full teardown)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for macso250/qwen3-tts-endpoint

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Finetuned

(16)

this model