## Real-World Performance on 4x RTX PRO 6000 (SM120) -- Honest Numbers

#7
by brandonmusic - opened

Real-World Performance on 4x RTX PRO 6000 (SM120) -- Honest Numbers

I want to share verified benchmark results for this model on SM120 (RTX PRO 6000 Blackwell Workstation Edition, 96GB GDDR7 each), since the performance claims floating around are inflated and misleading for anyone considering this hardware. If anyone else can get higher numbers I'd love to hear them!

Best Verified Result: 50.5 tok/s Sustained Decode

Configuration: vLLM 0.17.0rc1 nightly, TP=4, Marlin W4A16 MoE backend, no MTP, FLASHINFER attention, FP8 KV cache, 262K context, CUDA graphs + torch.compile enabled.

This is using the Marlin W4A16 fallback, not native NVFP4 CUTLASS. The native path is broken on SM120.

Why Not Faster?

The CUTLASS TMA Warp Specialized grouped GEMM kernels -- the fast path that makes NVFP4 worth using -- fail at runtime on SM120. All 80 fast tactics produce initialization errors (CUTLASS issue #3096). This forces a fallback to Marlin, which dequantizes FP4 weights to FP16 and runs standard GEMM. You lose roughly half the theoretical throughput.

MTP (Multi-Token Prediction) makes things worse, not better, on the Marlin path: -22% throughput (50.5 down to 39.6 tok/s). The MTP heads were trained on native FP4 activations, so they mispredict when running on Marlin's W4A16 activations (61-85% acceptance vs 89% baseline).

About the 130+ tok/s Claims

A community member has claimed 130-150 tok/s on the same hardware via custom forks. I reviewed both forks (SGLang and vLLM variants) and found zero kernel-level changes -- they use the same broken CUTLASS fallback. The Python-level changes (quantization config, MTP state management) cannot explain a 2.5x speedup. These numbers likely include speculative token counting (counting proposed-then-rejected MTP tokens as throughput) or are burst measurements, not sustained decode over 1000+ tokens.

If you are purchasing hardware based on these claims, be aware that 50.5 tok/s is the realistic sustained decode you should expect today on 4x RTX PRO 6000.

What I Tested (16 Configurations)

Configuration MoE Backend TP tok/s Notes
Marlin TP=4, no MTP Marlin W4A16 4 50.5 Best
Marlin TP=2+PP=2 Marlin W4A16 2+PP2 49 Slightly slower
Marlin + MTP=2 Marlin W4A16 4 39-40 MTP hurts
FlashInfer CUTLASS (Docker, 120f) CUTLASS 4 41 80 tactics skipped
FlashInfer CUTLASS (Docker, 120a) CUTLASS 4 26-40 Varies by build
vLLM native CUTLASS CUTLASS 4 ~5 Garbage output
TP=4 default (CUTLASS auto) CUTLASS 4 6-7 Garbage output
SGLang 0.5.8 FlashInfer 4 NaN Broken on SM120
Expert Parallel Marlin W4A16 2+EP2 1.4-2.6 Catastrophic on PCIe
TensorRT-LLM v1.1.0 -- -- N/A Arch not supported

Practical Recommendations for RTX PRO 6000 Users

  1. Force Marlin backend: Set VLLM_MOE_FORCE_MARLIN=1. Without this, vLLM will select the broken CUTLASS path and produce garbage.
  2. Disable MTP: Do not use --num-speculative-tokens or MTP flags. It reduces throughput on Marlin.
  3. Use TP=4: All 4 GPUs in tensor parallel. PP and EP are significantly worse on PCIe.
  4. Use FP8 KV cache: --kv-cache-dtype fp8_e4m3 --calculate-kv-scales saves VRAM without quality loss.
  5. Enable CUDA graphs: Do not use --enforce-eager.

Patches and PRs

Getting even this far required 12 patches across FlashInfer and vLLM. I have submitted these upstream:

The underlying CUTLASS bug is tracked at CUTLASS #3096 with no NVIDIA response as of 2026-03-11.

Bottom Line

50.5 tok/s sustained decode is usable for single-user inference. It is not what this hardware should be capable of -- if the native NVFP4 CUTLASS path worked, 100-150 tok/s should be achievable. But until NVIDIA fixes the SM120 TMA WS grouped GEMM kernels, the Marlin fallback is the only reliable path, and 50.5 tok/s is the honest number.

If anyone has achieved faster sustained decode on SM120 with correct output, I would genuinely like to know how.

Great analysis!

I'm sure performance will be fixed and exceed what's currently possible.

In the meantime thank you for your work and the suggestions.

thank you. One thing that is just leaving performance is on the table is switching from windows based (though WSL) to bare metal Linux. It cuts out the windows overhead leading to significant increases in speed.

Hello, have you ever tested the busbw of P2P between each card with nccl? , I only have 28Gb/s bandwidth,and nvtop looked at the link to see a little more than 40, always staying at 5@16X. This looks like no faster than the 4090 (4@16X)
20260420144537_359_199

20260420144550_360_199

I don't know if this will cause a decrease in performance

w7-3535x
256G DDR5 4800
x13swa
rtx pro 6000*2

Sign up or log in to comment