Can you run NVFP4 + Marlin?

#2
by einsteiner1983 - opened

Wondering how different this is from NVFP4 + Marlin kernels?

Canada Quant Labs org

On “NVFP4 + Marlin”: on Hopper/H200 there are no FP4 tensor cores, so vLLM already serves NVFP4 quants through the Marlin WNA16 path — i.e. the nvidia/GLM-5.2-NVFP4 column in our throughput tables (README / RESULTS.md / model card) is effectively NVFP4-on-Marlin. On Blackwell (sm_100/sm_120) NVFP4 gets native FP4 tensor cores and on SM120 vLLM also falls back to Marlin if the FP4 backend isn’t selected — we haven’t measured the native-Blackwell case yet (8× RTX PRO 6000 benchmarks are running now).
How this model differs: it’s INT4 W4A16 (GPTQ, asymmetric, group-128), not FP4 — and it ships a working MTP (multi-token-prediction) head for speculative decoding, which neither the NVFP4 nor the AWQ-INT4 community quants do. That’s the whole story behind the numbers: we lead at low/mid concurrency (c1 +79% vs NVFP4, c8 +14%) from MTP, and the no-MTP FP4/INT4 quants edge ~15% ahead at full saturation (c32). All ~4-bit quants of the same base, so it’s a throughput comparison, not a quality one.
Happy to go deeper if you meant something more specific by “NVFP4 + Marlin.”

Thanks for reply. I am not 100% sure and may test today myself if i have time. I just thinking https://huggingface.co/nvidia/GLM-5.2-NVFP4 has MTP heads so if I ran that with vLLM W4A16 of W4A8 (newer). Would it be similar without making this new Int4 quant? Would W4A8 be even better?

Is this very different?
vllm serve nvidia/GLM-5.2-NVFP4
--tensor-parallel-size 8
--enable-expert-parallel
--linear-backend marlin
--moe-backend marlin
--trust-remote-code
--kv-cache-dtype fp8_e4m3
--speculative-config '{"method":"mtp","num_speculative_tokens":5}'

Canada Quant Labs org
This comment has been hidden (marked as Off-Topic)
Canada Quant Labs org

Depends on what hardware you are using...

Sign up or log in to comment