kyaky/Qwen3.6-35B-A3B-NVFP4 ⚡

The smallest public NVFP4 build of Qwen3.6-35B-A3B, tuned for a rare mix of compact size, strong benchmark quality, and fast vLLM serving.

Tiny: 22.5 GB, about 3x smaller than the 67 GB BF16 base, and the smallest of all public NVFP4 builds in this comparison.
Fast: 200.6 tok/s single-stream, ahead of RedHatAI at 170.0 tok/s and unsloth at 175.3 tok/s.
Quality: MMLU-Pro 0.825, tied for #1, and GSM8K 0.920, #1 of the field.

NVIDIA's own build is faster at 223.6 tok/s because it runs on its native sm120 hardware optimum. This build is positioned as the Pareto-best quality+size option while still beating the rest of the open field on single-stream speed.

Comparison

Model	Size GB	MMLU-Pro	GSM8K	Single-stream tok/s	N16 throughput
kyaky/Qwen3.6-35B-A3B-NVFP4	22.5	0.825	0.920	200.6	1581
nvidia	23.4	0.817	0.910	223.6	1646
redhat	24.0	0.819	0.910	170.0	1422
unsloth	24.7	0.825	0.890	175.3	1493

Measured on NVIDIA RTX PRO 6000 (Blackwell, sm120) with vLLM 0.23, TP1. Quality via lm-eval (thinking-on, flexible-extract); speed read from the engine /metrics. MMLU-Pro --limit is per-subject (≈1.4k items).

Quantization Recipe

This is a self-quantized NVFP4 build of Qwen/Qwen3.6-35B-A3B produced with llm-compressor.

FP8: attention, GDN, and shared-expert paths
NVFP4 weight-only W4A16: routed MoE experts
BF16: lm_head, router, and vision components
Calibration: all-expert calibration (every one of the 256 experts receives statistics)
Serving target: vLLM with Marlin on Blackwell sm120

The recipe keeps sensitive routing and output layers in BF16, uses FP8 where activation-aware compression is a good fit, and applies NVFP4 weight-only compression to the routed experts where the footprint win matters most. On sm120 the native FP4 MoE GEMM is unavailable, so NVFP4 experts serve through the Marlin W4A16 path either way — weight-only therefore delivers the same speed with less quantization error.

Serve with vLLM

vllm serve kyaky/Qwen3.6-35B-A3B-NVFP4 \
  --served-model-name qwen3.6-35b-a3b-nvfp4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_xml

This model thinks by default (<think>…</think>, parsed out by --reasoning-parser qwen3) and supports tool calling via the Qwen3 XML parser. Best served on Blackwell sm120 hardware, where vLLM routes the NVFP4 experts through the Marlin path automatically.

Why This Build

kyaky/Qwen3.6-35B-A3B-NVFP4 is for users who want the Qwen3.6-35B-A3B capability profile in a much smaller package without giving up the headline quality scores. It is the most compact model in the field, matches or leads on quality, and remains fast enough to beat the non-NVIDIA public NVFP4 builds on single-stream generation.

Downloads last month: 59

Safetensors

Model size

21B params

Tensor type

F32

BF16

F8_E4M3

Model tree for kyaky/Qwen3.6-35B-A3B-NVFP4

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(544)

this model