Maya1 — NVFP4 (compressed-tensors, for vLLM)
A 4-bit NVFP4 quantization of maya-research/maya1
(Apache-2.0), packaged as compressed-tensors so it loads directly in vLLM with the native
SM120/SM121 CUTLASS NVFP4 GEMM. Built and measured on an NVIDIA DGX Spark (GB10, sm_121a); it
also runs on any Blackwell + vLLM (RTX 50-series share the same warp-level FP4 path).
- ~2.5 GB weights (from 6.2 GB bf16); ~12 GB GPU resident served.
- Keeps Maya's emotion tags + natural-language voice design.
Why this quant
Single-stream TTS decode is memory-bandwidth bound. Quantizing the weights raises throughput on bandwidth-limited GPUs (e.g. GB10's ~273 GB/s unified memory). Measured on GB10 (decode tok/s): bf16 28 → fp8 54 → NVFP4 ~72–75.
Calibration matters for a TTS model. Generic-text calibration mis-scales the audio-token path. This checkpoint was calibrated on 96 in-distribution Maya sequences (real
<description=…> textprompts + their generated SNAC token streams), so the emotion/audio tokens are properly represented. Quant method:llm-compressorQuantizationModifier(scheme="NVFP4"),lm_headleft unquantized.
Measured (GB10, NVFP4)
| Metric | Value |
|---|---|
| First-audio (streamed) | ~0.46 s |
| Throughput | ~72–75 tok/s |
| RTF | ~1.25 |
| GPU resident | ~12 GB |
| Sample rate | 24 kHz mono |
How to run
Maya emits SNAC audio tokens, not audio — you need (1) vLLM to serve the LM and (2) a thin decoder that formats the prompt, parses the SNAC tokens, and decodes them to a waveform. Both below.
1) Serve the LM with vLLM (compressed-tensors NVFP4 is auto-detected):
vllm serve <this-repo> --served-model-name maya1 \
--return-tokens-as-token-ids --max-model-len 4096 --trust-remote-code
# DGX Spark / aarch64: use the vllm/vllm-openai:*-aarch64-cu130 image.
2) Decode SNAC → audio with the included wrapper (maya_tts_server.py), which exposes an
OpenAI-compatible /v1/audio/speech (streaming raw PCM or wav):
pip install snac fastapi uvicorn httpx soundfile numpy torch
VLLM_URL=http://localhost:8002 python maya_tts_server.py # serves :8003
curl -X POST http://localhost:8003/v1/audio/speech -H "Content-Type: application/json" \
-d '{"input":"Our update <laugh_harder> finally ships!",
"description":"American female, 20s, warm, fast pacing.",
"stream":true,"stream_format":"audio","response_format":"pcm"}' | ffplay -f s16le -ar 24000 -ac 1 -
Usage notes
description= natural-language voice design (gender/age/accent/pitch/timbre/pace).- Emotion tags:
<laugh_harder><sigh><whisper><angry><giggle><chuckle><gasp><cry>… Plain<laugh>can read subtle on some voices — prefer<laugh_harder>for an audible laugh. fast pacingin the description tightens delivery (lower total latency).- Stream the PCM and play as it arrives to get the ~0.46 s first-audio.
Samples
Voice design — gritty man
Emotion — angry (<angry>)
Emotion — laugh (<laugh_harder>)
Style — whisper
License & ethics
Derivative of maya-research/maya1, Apache-2.0. No quantization changes the license. Do not use
for unauthorized voice cloning, impersonation, fraud, or any unlawful/unethical purpose.
Attribution
Base model: Maya1 by Maya Research. Quantization + serving wrapper: Sggin1.
Codec: hubertsiuzdak/snac_24khz.
- Downloads last month
- 10
Model tree for Sggin/maya1-nvfp4
Base model
maya-research/maya1