kyaky/Qwen3.6-35B-A3B-NVFP4 ⚡

The smallest public NVFP4 build of Qwen3.6-35B-A3B, tuned for a rare mix of compact size, strong benchmark quality, and fast vLLM serving.

  • Tiny: 22.5 GB, about 3x smaller than the 67 GB BF16 base, and the smallest of all public NVFP4 builds in this comparison.
  • Fast: 200.6 tok/s single-stream, ahead of RedHatAI at 170.0 tok/s and unsloth at 175.3 tok/s.
  • Quality: MMLU-Pro 0.825, tied for #1, and GSM8K 0.920, #1 of the field.

NVIDIA's own build is faster at 223.6 tok/s because it runs on its native sm120 hardware optimum. This build is positioned as the Pareto-best quality+size option while still beating the rest of the open field on single-stream speed.

Per-metric leaderboard vs the public NVFP4 builds

Comparison

Model Size GB MMLU-Pro GSM8K Single-stream tok/s N16 throughput
kyaky/Qwen3.6-35B-A3B-NVFP4 22.5 0.825 0.920 200.6 1581
nvidia 23.4 0.817 0.910 223.6 1646
redhat 24.0 0.819 0.910 170.0 1422
unsloth 24.7 0.825 0.890 175.3 1493

Benchmark bar chart

Measured on NVIDIA RTX PRO 6000 (Blackwell, sm120) with vLLM 0.23, TP1. Quality via lm-eval (thinking-on, flexible-extract); speed read from the engine /metrics. MMLU-Pro --limit is per-subject (≈1.4k items).

Quantization Recipe

This is a self-quantized NVFP4 build of Qwen/Qwen3.6-35B-A3B produced with llm-compressor.

  • FP8: attention, GDN, and shared-expert paths
  • NVFP4 weight-only W4A16: routed MoE experts
  • BF16: lm_head, router, and vision components
  • Calibration: all-expert calibration (every one of the 256 experts receives statistics)
  • Serving target: vLLM with Marlin on Blackwell sm120

The recipe keeps sensitive routing and output layers in BF16, uses FP8 where activation-aware compression is a good fit, and applies NVFP4 weight-only compression to the routed experts where the footprint win matters most. On sm120 the native FP4 MoE GEMM is unavailable, so NVFP4 experts serve through the Marlin W4A16 path either way — weight-only therefore delivers the same speed with less quantization error.

Serve with vLLM

vllm serve kyaky/Qwen3.6-35B-A3B-NVFP4 \
  --served-model-name qwen3.6-35b-a3b-nvfp4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_xml

This model thinks by default (<think>…</think>, parsed out by --reasoning-parser qwen3) and supports tool calling via the Qwen3 XML parser. Best served on Blackwell sm120 hardware, where vLLM routes the NVFP4 experts through the Marlin path automatically.

Why This Build

kyaky/Qwen3.6-35B-A3B-NVFP4 is for users who want the Qwen3.6-35B-A3B capability profile in a much smaller package without giving up the headline quality scores. It is the most compact model in the field, matches or leads on quality, and remains fast enough to beat the non-NVIDIA public NVFP4 builds on single-stream generation.

Downloads last month
59
Safetensors
Model size
21B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kyaky/Qwen3.6-35B-A3B-NVFP4

Quantized
(544)
this model