canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP

NVFP4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — same quantization math as RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 but with the MTP block preserved in the saved weights so vLLM can load it with --speculative-config method=mtp.

TL;DR

Recommended hardware 4× B300 TP=4 · or RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated
Quality GSM8K 91.81% strict (8-shot); MMLU-Pro 81.13%; HumanEval pass@1 0.915 (EvalPlus)
Throughput 278.68 output tok/s @ bs=1 chat-code on B300 TP=4 (2.13× vs RedHat NVFP4); RTX PRO 6000 94.6 @ TP=2 / 101.0 @ TP=4 at bs=1
MTP acceptance 87.96% on chat-code at bs=1 / k=2 — flat across bs=1 to bs=16
Spec-decode speedup 1.8–2.1× decode vs RedHat NVFP4 (workload-dependent)
Differentiator Only V4-Flash NVFP4 quant where --speculative-config method=mtp actually fires — RedHat's artifact dropped MTP during calibration load

Family / related artifacts

Repo Role Relation to this artifact
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP sibling W4A16 routed experts (Hopper-compatible), MTP retained — same MTP-preservation pattern. Note: on RTX PRO 6000 (SM 12.0) the W4A16 sibling's Marlin MoE decode path corrupts ~50% of generations under concurrent thinking-mode load. For batched thinking-mode workloads on SM 12.0, this NVFP4 artifact is the recommended choice. See Card D's Honest limitations and the debug log.
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 predecessor (no-MTP baseline) W4A16 + FP8 without MTP — broadest hardware compatibility
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP larger sibling Same NVFP4 + MTP recipe applied to V4-Pro; B300-only deployment
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 upstream reference Same quant math; MTP block dropped by transformers silent-strip (the bug this artifact fixes)

Why this exists

The HF transformers DSV4 modeling class declares _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"], which silently strips MTP keys during the calibration load path. RedHat's NVFP4-FP8 artifact ran through that path, so their saved weights don't include MTP — and serving cannot use V4-Flash's spec-decode head. This artifact patches the modeling class during calibration so MTP keys (mtp.0.*, 799 tensors) survive at BF16. The result: an NVFP4 artifact that's structurally identical to RedHat's on the math, but loadable with --speculative-config method=mtp for ~2× decode speedup.

Architecture & precision

Base model

Property Value
Total parameters 284 B (13 B active per token)
Decoder layers 43
Routed experts / layer 256 (top-K = 6)
Hidden size 4096
Base BF16 size ~600 GB
Quantized size 172 GB across 35 safetensors shards

Component precisions

Component Format Method
Routed FFN experts (w1, w2, w3 per expert) NVFP4 group=16 weight static + input dynamic "local" FP4 group=16, nvfp4-pack-quantized
Attention path (wq_a, wq_b, wkv, wo_a, wo_b and fused) FP8_BLOCK 128×128 weight static + input dynamic FP8 group=128, float-quantized
MTP block (mtp.0.*) BF16 Preserved verbatim (799 tensors)
lm_head, embed_tokens, norms, ffn.gate, ffn.shared_experts, attn compressor, attn indexer, attn_sink, hc_* BF16 Unquantized

Hardware validated

Platform SM HBM/GPU Interconnect TP Role
4× NVIDIA B300 SXM6 AC 10.3, sm_103a 288 GB HBM3e NVLink 4 (TP=8 for BF16 reference) Primary — all accuracy + throughput numbers
4× NVIDIA RTX PRO 6000 Blackwell Server Edition 12.0, sm_120 96 GB HBM PCIe TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica) Also validated — both TP configs + GSM8K-50 cross-check, 3 extra patches

Both platforms serve cuda graphs ON. Same artifact, no weight changes between SKUs.

Benchmarks

Quality (hardware-invariant — measured on B300)

Measured 2026-05-21 on 4× B300 SXM6 AC (TP=4 for quant configs, TP=8 for BF16 reference which doesn't fit at TP=4). Greedy, temperature 0. The same artifact serves on RTX PRO 6000 Blackwell with no weight changes; GSM8K-50 cross-check: 88% strict TP=2 / 90% strict TP=4 on RTX PRO 6000 vs 91.81% strict full-set on B300 (within noise).

Benchmark Setting This artifact BF16 + MTP reference RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 (no MTP)
AIME 2024 raw pass@1, thinking=high, max_tokens=65536 25/30 = 83.33% 25/30 = 83.33% 27/30 = 90.00%
AIME 2024 non-truncated pass@1 24/25 = 96.00% 25/26 = 96.15% 27/28 = 96.43%
AIME 2024 wall-clock for 30 problems @ bs=8 476 s 490 s 1405 s
GSM8K 8-shot, strict-match 0.9181 0.9484 / 0.9522 (no-MTP / MTP) 0.910 (self-reported)
GSM8K 8-shot, flexible-extract 0.9515 0.9477 / 0.9515 not reported
MMLU-Pro 5-shot, custom-extract 0.8113 not measured not reported
HumanEval pass@1 (EvalPlus) 0.915 not measured 0.896
HumanEval+ pass@1 (EvalPlus) 0.848 not measured 0.860
IFEval prompt-strict (B300) 0.8540 not measured 0.8207
IFEval prompt-strict (RTX PRO 6000 TP=4, 2026-05-24, JSON evidence) 0.8429 (-1.1pp vs B300)

On raw AIME pass@1, RedHat scores higher (27/30 vs ours 25/30) — but the gap is entirely truncation rate at the 65K max_tokens cap (RedHat truncated 2/30, ours 5/30). On non-truncated pass@1, all three configs are within 0.4 pt of each other (96.0–96.4%). Quantization quality is equivalent on AIME 2024; the differentiator is wall-clock.

Throughput

4× B300 SXM6 (sm_103a, NVLink, TP=4)

Same hardware, same TP=4, same prompts as the quality table.

Workload Operating point This artifact RedHat NVFP4 (no MTP) Ratio
AIME 2024 reasoning (thinking=high, bs=8) wall-clock for 30 problems 476 s 1405 s 2.95×
AIME 2024 reasoning per-request median output tok/s 182.9 99.6 1.84×
Coding (HumanEval chat, bs=1) output tok/s 278.68 131.06 2.13×
Coding (HumanEval chat, bs=4) output tok/s 649.35 417.87 1.55×
Coding (HumanEval chat, bs=8) output tok/s 1104.89 673.12 1.64×
Coding (HumanEval chat, bs=16) output tok/s 1577.20 1007.78 1.56×

Two ratios to disambiguate:

  • Pure decode throughput: at bs=1 chat coding, 2.13× faster. On AIME reasoning at bs=8, per-request median is 182.9 vs 99.6 tok/s — 1.84×. The decode ratio is workload-dependent (acceptance % varies) but lands in the 1.8–2.1× range across measured workloads.
  • AIME batch wall-clock: 1405 s / 476 s = 2.95×. This includes the truncation-rate differential at 65K — 5/30 of our responses truncated vs 2/30 of RedHat's, and truncated responses run to the cap, inflating RedHat's total wall-clock. The 2.95× number is "time to run AIME 2024 end-to-end," not "raw decode speed."

4× RTX PRO 6000 Blackwell (sm_120, PCIe, TP=2 and TP=4)

Validated 2026-05-23 on a Brev familiar-teal-worm instance. Per-replica vllm bench serve random 256-in/256-out, num_speculative_tokens=1 (SM 12.0 caps spec at k=1). MTP-on for all rows.

Config bs=1 output tok/s bs=4 output tok/s bs=16 output tok/s bs=1 TPOT median MTP acceptance GSM8K-50 strict
TP=2 94.6 218.5 360.5 9.05 ms 70–73% 88%
TP=4 101.0 254.0 440.1 8.20 ms 67–75% 90%

At bs=16, TP=4 is 1.22× faster per-replica than TP=2 on this hardware — opposite of B300, where TP=4 beats TP=8 due to NVFP4 tensor-core underutilization. RTX PRO 6000's slower PCIe interconnect plus lower per-GPU compute means extra parallelism still pays off at all batch sizes measured.

For context on the same RTX PRO 6000 box, the W4A16-FP8-MTP sibling measured 98.83 tok/s at TP=2 bs=1 — equivalent decode throughput, with NVFP4 trading ~4% per-replica throughput for ~10% smaller on-disk footprint (172 GB vs 159 GB).

AIME-2024 deep thinking-mode concurrency sweep (2026-05-25, TP=4)

cuda graphs ON (capture sizes [1,2,4,8]), MTP num_speculative_tokens=1, max-model-len=16384. Bench JSONs at canada-quant/dsv4-flash-nvfp4-fp8-mtp/benchmarks/rtxpro6000/.

Concurrency Correct/30 Stop / Length Errors Wall (s) Problems/min MTP accept Speedup vs c=1
c=1 (sequential) 24/30 (80.0%) 22 / 8 0 1453.9 1.24 90.61% 1.0×
c=2 23/30 (76.7%) 23 / 7 0 787.6 2.29 90.75% 1.85×
c=4 21/30 (70.0%) 20 / 10 0 386.6 4.66 90.93% 3.76×
c=8 (terminated)

Findings:

  • 0 errors and 0 stopped-but-wrong at c=1/2/4. Every wrong answer is length-truncated at max_tokens, not a quality issue — non-truncated pass@1 is essentially 100%.
  • MTP acceptance stable at 90.6–90.9% across c=1/c=2/c=4. The NVFP4 flashinfer_trtllm MoE backend on SM 12.0 is rock-solid under all tested concurrencies (unlike the W4A16 sibling's Marlin MoE path — see Card D for that story).
  • c=8 throughput collapse: TP=4 with no NVLink (PCIe-only) drops combined throughput from 450 t/s @ c=4 to ~38 t/s @ c=8 — a 12× per-request slowdown. MTP itself stayed healthy; the bottleneck is TP-allreduce communication over PCIe at high concurrency. Recommendation for higher aggregate throughput on RTX PRO 6000: run 2 replicas at TP=2 instead of 1 replica at TP=4 c=8.

MTP draft-token acceptance per workload (B300, bs=1, k=2)

Workload Acceptance
Random prompts (1024 in / 512 out) 10.75%
Code, raw completion (HumanEval /v1/completions) 67.29%
Code, chat-templated (HumanEval /v1/chat/completions, bs=1) 87.96%
Code, chat-templated, bs=4 / bs=8 / bs=16 88.27% / 87.92% / 88.19%
Instruction following (IFEval) ~58.5%
AIME 2024 reasoning (thinking=high) 81.60%

Acceptance does not degrade under batching — flat at 88.0% ± 0.4% across bs=1 to bs=16 on chat-templated coding. RTX PRO 6000 acceptance lands in 67–75% on the random-prompt workload (256-in/256-out, not directly comparable to the workload-specific rows above).

Quick start

One-line installer (applies all common patches):

curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash

Serve with MTP spec-decode (B300):

CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
  vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Without spec-decode:

CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
  vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8

Recommended TP:

  • B300: TP=4. TP=8 is slower than TP=4 at bs≥4 by up to 21.6% — per-rank MoE expert shards at TP=8 underutilize NVFP4 tensor-core kernels.
  • RTX PRO 6000: TP=4 with reduced cudagraph captures + --max-num-seqs 8 --max-num-batched-tokens 2048 to fit memory. TP=2 also works; expect 1.22× lower per-replica throughput at bs=16.

Quantization recipe

Property Value
Dataset HuggingFaceH4/ultrachat_200k train_sft (V4 chat template)
Samples 64 × max_seq_len 512 × batch_size 1, seed 42
Modifier class QuantizationModifier (not GPTQ — Hessian-reduce path hangs on multi-rank B300)
Hardware calibration on B300

Calibration corpus is 12× smaller than RedHat's reference recipe (64 vs 768 samples). On the benchmarks measured, GSM8K / HumanEval / IFEval / MMLU-Pro / AIME-non-truncated all land within noise of the reference. The visible cost of reduced coverage is AIME truncation rate (5/30 vs RedHat's 2/30 at the 65K max_tokens cap), consistent with looser calibration scales producing less-converging reasoning trajectories. A v0.2 recipe with 768 samples is planned.

Group Modules Scheme Format
attention wq_a, wq_b, wkv, wo_a, wo_b (and fused variants) FP8_BLOCK 128×128, weight static + input dynamic FP8 group=128 float-quantized
experts w1, w2, w3 per expert NVFP4 group=16, weight static + input dynamic "local" FP4 group=16 nvfp4-pack-quantized
ignored lm_head, embed_tokens, norms, ffn.gate, ffn.shared_experts, attn compressor, attn indexer, attn_sink, hc_* unquantized (BF16) n/a
MTP block (mtp.0.*) all 799 keys unquantized (BF16, preserved verbatim) n/a

vLLM build

Common patches (all platforms)

PR Purpose Status
vllm-project/vllm#43248 bool() wrap on is_static_input_scheme open
vllm-project/vllm#43288 .get("scale_fmt", "ue8m0") on missing key + BF16 getattr follow-up open
vllm-project/vllm#43290 weight_scale_inv-or-weight_scale fallback open
vllm-project/vllm#43319 MTP-quant-detect from safetensors header + BF16 wo_a fallback path open

The one-line installer applies all four automatically.

RTX PRO 6000 Blackwell (SM 12.0) only

Three SM 12.0-specific patches required on top of the four common patches. Diffs in patches/sm120_*.diff in the source repo. Full rationale at docs/RECIPE_RTX6000PRO.md.

  1. VLLM_TEST_FORCE_FP8_MARLIN=1 env var — bypasses the NVFP4 MoE backend selector's swiglu_limit filter (no FLASHINFER_TRTLLM NVFP4 kernel auto-selects on SM 12.0).
  2. weight_scale_inv-or-weight_scale fallback in Marlin's scaled_mm/marlin.py (PR #43290 covers attention.py only; SM 12.0 also hits Marlin's pre-process site).
  3. Skip Marlin pre-processing for layers tagged is_bmm=True — DSV4 wo_a/wo_b/compressor.wkv use the SM 12.0 Triton fp8_einsum kernel directly; Marlin's tile-layout repack breaks the original (N, K) layout the einsum expects.

B300 deployments can skip all three.

Honest limitations

  1. AIME truncation rate at 65K — 5/30 of responses hit the cap on long reasoning traces vs RedHat's 2/30. Consistent with the 12×-smaller calibration corpus producing less-converging reasoning trajectories. Non-truncated pass@1 is at parity with RedHat. v0.2 with 768 samples planned.
  2. NVFP4 MoE backend selector on SM 12.0 — no FLASHINFER_TRTLLM kernel auto-selects, requires the VLLM_TEST_FORCE_FP8_MARLIN=1 env var to route through Marlin. Native NVFP4 SM 12.0 kernels exist in upstream vLLM (csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu) but aren't picked by the backend selector (vllm-project/vllm#31085).
  3. k=1 cap on RTX PRO 6000 — SM 12.0 caps spec-decode at num_speculative_tokens=1; B300 supports k=2.
  4. AIME thinking acceptance @ 81.60% is lower than the chat-code 87.96% headline — workload-dependent, expected, called out for transparency.
  5. IFEval re-bench 2026-05-24 (RTX PRO 6000 TP=4) — close to published B300 numbers but slightly lower. A fresh lm_eval ifeval --apply_chat_template num_concurrent=16 measurement on RTX PRO 6000 TP=4 (post-PR-#40923-rebuild) returned prompt_strict 0.8429, prompt_loose 0.8780, inst_strict 0.8945, inst_loose 0.9185 — within 0.6–1.5 pp of the published markdown numbers (0.8540 / 0.8928 / 0.9005 / 0.9293). The published numbers likely came from B300 (the primary benchmark platform); RTX PRO 6000 measurements are slightly lower but consistent. Raw JSON evidence now committed at benchmarks/rtxpro6000/ifeval_2026_05_24.json. Originally flagged as "no on-disk JSON evidence"; that gap is now closed.

Reproduction

Full replication recipe at docs/recipes/nvfp4_fp8_mtp_replication.md — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).

Upstream contributions filed during this work

PR / Issue Description Status
vllm-project/vllm#43248 bool() wrap on is_static_input_scheme open
vllm-project/vllm#43288 .get("scale_fmt", "ue8m0") defensive + BF16 follow-up open
vllm-project/vllm#43290 weight_scale_inv-or-weight_scale fallback open
vllm-project/vllm#43319 MTP-quant-detect from safetensors + BF16 wo_a fallback open
vllm-project/vllm#43297 (1,)-shape global_scale loader broadcast (issue) open
vllm-project/vllm#43304 MTP draft inherits main quant scheme (issue) partially addressed by #43319
vllm-project/llm-compressor#2745 MTP inference-mode crash open
vllm-project/compressed-tensors#711 sharded-module load path open

PR vllm-project/vllm#42209 (sychen52, xinli-sw, pavanimajety, zyongye — NVIDIA) which added the DSV4 NVFP4 MoE kernel merged 2026-05-22; this artifact serves on top of that.

Changes

Date Change
2026-05-21 Initial release on B300 — GSM8K 0.9181, HumanEval 0.915, IFEval 0.8540, MTP acceptance 87.96% on chat-code
2026-05-23 RTX PRO 6000 Blackwell (SM 12.0) validation added. TP=2 and TP=4 confirmed, MTP acceptance 67–75%, GSM8K-50 within noise of B300
2026-05-24 Cross-card finding: AIME c=4 thinking-mode on RTX PRO 6000 shows this NVFP4 artifact produces 1/30 token-corrupted generations vs the W4A16-MTP sibling's 14/30 corrupted on the same hardware + vLLM build. The W4A16 sibling has a Marlin MoE decode race on SM 12.0; this NVFP4 artifact via flashinfer_trtllm MoE is the recommended deployment for batched thinking-mode on RTX PRO 6000. Filed upstream: jasl/vllm#12.

Files in the artifact

  • 35 sharded model-*.safetensors files + model.safetensors.index.json (172 GB total)
  • config.json — vLLM-compatible quantization_config with fused targets + W8A8 input_activations
  • tokenizer.json, tokenizer_config.json, generation_config.json — upstream DSV4-Flash
  • chat_template.jinja — upstream DSV4-Flash (unchanged)
  • recipe.yaml — the llm-compressor calibration recipe
  • README.md — this file

Citation

@misc{canada-quant-dsv4-flash-nvfp4-fp8-mtp-2026,
  title  = {DeepSeek-V4-Flash NVFP4-FP8 with MTP preserved for vLLM speculative decoding},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.

Acknowledgments

  • DeepSeek for V4-Flash and the MTP architecture.
  • RedHat AI for the NVFP4-FP8 reference recipe.
  • PR #42209 contributors (sychen52, xinli-sw, pavanimajety, zyongye) for the DSV4 NVFP4 MoE kernel work that made serving possible.
  • canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (W4A16 sibling) for the alias-injection pattern and MTP acceptance methodology.
  • vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.
Downloads last month
81
Safetensors
Model size
171B params
Tensor type
F32
·
I32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP

Quantized
(55)
this model