canada-quant/GLM-5.2-W4A16-MTP · Can you run NVFP4 + Marlin?

Can you run NVFP4 + Marlin?

by einsteiner1983 - opened 2 days ago

Discussion

einsteiner1983

2 days ago

Wondering how different this is from NVFP4 + Marlin kernels?

pastapaul

Canada Quant Labs org 2 days ago

On “NVFP4 + Marlin”: on Hopper/H200 there are no FP4 tensor cores, so vLLM already serves NVFP4 quants through the Marlin WNA16 path — i.e. the nvidia/GLM-5.2-NVFP4 column in our throughput tables (README / RESULTS.md / model card) is effectively NVFP4-on-Marlin. On Blackwell (sm_100/sm_120) NVFP4 gets native FP4 tensor cores and on SM120 vLLM also falls back to Marlin if the FP4 backend isn’t selected — we haven’t measured the native-Blackwell case yet (8× RTX PRO 6000 benchmarks are running now).
How this model differs: it’s INT4 W4A16 (GPTQ, asymmetric, group-128), not FP4 — and it ships a working MTP (multi-token-prediction) head for speculative decoding, which neither the NVFP4 nor the AWQ-INT4 community quants do. That’s the whole story behind the numbers: we lead at low/mid concurrency (c1 +79% vs NVFP4, c8 +14%) from MTP, and the no-MTP FP4/INT4 quants edge ~15% ahead at full saturation (c32). All ~4-bit quants of the same base, so it’s a throughput comparison, not a quality one.
Happy to go deeper if you meant something more specific by “NVFP4 + Marlin.”

einsteiner1983

1 day ago

Thanks for reply. I am not 100% sure and may test today myself if i have time. I just thinking https://huggingface.co/nvidia/GLM-5.2-NVFP4 has MTP heads so if I ran that with vLLM W4A16 of W4A8 (newer). Would it be similar without making this new Int4 quant? Would W4A8 be even better?

einsteiner1983

1 day ago

Is this very different?
vllm serve nvidia/GLM-5.2-NVFP4
--tensor-parallel-size 8
--enable-expert-parallel
--linear-backend marlin
--moe-backend marlin
--trust-remote-code
--kv-cache-dtype fp8_e4m3
--speculative-config '{"method":"mtp","num_speculative_tokens":5}'

pastapaul

Canada Quant Labs org 1 day ago

This comment has been hidden (marked as Off-Topic)

pastapaul

Canada Quant Labs org 1 day ago

Depends on what hardware you are using...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment