Maya1 — NVFP4 (compressed-tensors, for vLLM)

A 4-bit NVFP4 quantization of maya-research/maya1 (Apache-2.0), packaged as compressed-tensors so it loads directly in vLLM with the native SM120/SM121 CUTLASS NVFP4 GEMM. Built and measured on an NVIDIA DGX Spark (GB10, sm_121a); it also runs on any Blackwell + vLLM (RTX 50-series share the same warp-level FP4 path).

  • ~2.5 GB weights (from 6.2 GB bf16); ~12 GB GPU resident served.
  • Keeps Maya's emotion tags + natural-language voice design.

Why this quant

Single-stream TTS decode is memory-bandwidth bound. Quantizing the weights raises throughput on bandwidth-limited GPUs (e.g. GB10's ~273 GB/s unified memory). Measured on GB10 (decode tok/s): bf16 28 → fp8 54 → NVFP4 ~72–75.

Calibration matters for a TTS model. Generic-text calibration mis-scales the audio-token path. This checkpoint was calibrated on 96 in-distribution Maya sequences (real <description=…> text prompts + their generated SNAC token streams), so the emotion/audio tokens are properly represented. Quant method: llm-compressor QuantizationModifier(scheme="NVFP4"), lm_head left unquantized.

Measured (GB10, NVFP4)

Metric Value
First-audio (streamed) ~0.46 s
Throughput ~72–75 tok/s
RTF ~1.25
GPU resident ~12 GB
Sample rate 24 kHz mono

How to run

Maya emits SNAC audio tokens, not audio — you need (1) vLLM to serve the LM and (2) a thin decoder that formats the prompt, parses the SNAC tokens, and decodes them to a waveform. Both below.

1) Serve the LM with vLLM (compressed-tensors NVFP4 is auto-detected):

vllm serve <this-repo> --served-model-name maya1 \
  --return-tokens-as-token-ids --max-model-len 4096 --trust-remote-code
# DGX Spark / aarch64: use the vllm/vllm-openai:*-aarch64-cu130 image.

2) Decode SNAC → audio with the included wrapper (maya_tts_server.py), which exposes an OpenAI-compatible /v1/audio/speech (streaming raw PCM or wav):

pip install snac fastapi uvicorn httpx soundfile numpy torch
VLLM_URL=http://localhost:8002 python maya_tts_server.py   # serves :8003
curl -X POST http://localhost:8003/v1/audio/speech -H "Content-Type: application/json" \
  -d '{"input":"Our update <laugh_harder> finally ships!",
       "description":"American female, 20s, warm, fast pacing.",
       "stream":true,"stream_format":"audio","response_format":"pcm"}' | ffplay -f s16le -ar 24000 -ac 1 -

Usage notes

  • description = natural-language voice design (gender/age/accent/pitch/timbre/pace).
  • Emotion tags: <laugh_harder> <sigh> <whisper> <angry> <giggle> <chuckle> <gasp> <cry> … Plain <laugh> can read subtle on some voices — prefer <laugh_harder> for an audible laugh.
  • fast pacing in the description tightens delivery (lower total latency).
  • Stream the PCM and play as it arrives to get the ~0.46 s first-audio.

Samples

Voice design — gritty man

Emotion — angry (<angry>)

Emotion — laugh (<laugh_harder>)

Style — whisper

License & ethics

Derivative of maya-research/maya1, Apache-2.0. No quantization changes the license. Do not use for unauthorized voice cloning, impersonation, fraud, or any unlawful/unethical purpose.

Attribution

Base model: Maya1 by Maya Research. Quantization + serving wrapper: Sggin1. Codec: hubertsiuzdak/snac_24khz.

Downloads last month
10
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sggin/maya1-nvfp4

Quantized
(9)
this model