canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP

NVFP4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — same quantization math as RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 but with the MTP block preserved in the saved weights so vLLM can load it with --speculative-config method=mtp.

TL;DR


Recommended hardware	4× B300 TP=4 · or RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated
Quality	GSM8K 91.81% strict (8-shot); MMLU-Pro 81.13%; HumanEval pass@1 0.915 (EvalPlus)
Throughput	278.68 output tok/s @ bs=1 chat-code on B300 TP=4 (2.13× vs RedHat NVFP4); RTX PRO 6000 94.6 @ TP=2 / 101.0 @ TP=4 at bs=1
MTP acceptance	87.96% on chat-code at bs=1 / k=2 — flat across bs=1 to bs=16
Spec-decode speedup	1.8–2.1× decode vs RedHat NVFP4 (workload-dependent)
Differentiator	Only V4-Flash NVFP4 quant where `--speculative-config method=mtp` actually fires — RedHat's artifact dropped MTP during calibration load

Family / related artifacts

Repo	Role	Relation to this artifact
`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`	sibling	W4A16 routed experts (Hopper-compatible), MTP retained — same MTP-preservation pattern. Note: on RTX PRO 6000 (SM 12.0) the W4A16 sibling's Marlin MoE decode path corrupts ~50% of generations under concurrent thinking-mode load. For batched thinking-mode workloads on SM 12.0, this NVFP4 artifact is the recommended choice. See Card D's Honest limitations and the debug log.
`canada-quant/DeepSeek-V4-Flash-W4A16-FP8`	predecessor (no-MTP baseline)	W4A16 + FP8 without MTP — broadest hardware compatibility
`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`	larger sibling	Same NVFP4 + MTP recipe applied to V4-Pro; B300-only deployment
`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`	upstream reference	Same quant math; MTP block dropped by `transformers` silent-strip (the bug this artifact fixes)

Why this exists

The HF transformers DSV4 modeling class declares _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"], which silently strips MTP keys during the calibration load path. RedHat's NVFP4-FP8 artifact ran through that path, so their saved weights don't include MTP — and serving cannot use V4-Flash's spec-decode head. This artifact patches the modeling class during calibration so MTP keys (mtp.0.*, 799 tensors) survive at BF16. The result: an NVFP4 artifact that's structurally identical to RedHat's on the math, but loadable with --speculative-config method=mtp for ~2× decode speedup.

Architecture & precision

Base model

Property	Value
Total parameters	~~284 B (~~13 B active per token)
Decoder layers	43
Routed experts / layer	256 (top-K = 6)
Hidden size	4096
Base BF16 size	~600 GB
Quantized size	172 GB across 35 safetensors shards

Component precisions

Component	Format	Method
Routed FFN experts (`w1`, `w2`, `w3` per expert)	NVFP4 group=16	weight static + input dynamic "local" FP4 group=16, `nvfp4-pack-quantized`
Attention path (`wq_a`, `wq_b`, `wkv`, `wo_a`, `wo_b` and fused)	FP8_BLOCK 128×128	weight static + input dynamic FP8 group=128, `float-quantized`
*MTP block (`mtp.0.`)**	BF16	Preserved verbatim (799 tensors)
`lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*`	BF16	Unquantized

Hardware validated

Platform	SM	HBM/GPU	Interconnect	TP	Role
4× NVIDIA B300 SXM6 AC	10.3, sm_103a	288 GB HBM3e	NVLink	4 (TP=8 for BF16 reference)	Primary — all accuracy + throughput numbers
4× NVIDIA RTX PRO 6000 Blackwell Server Edition	12.0, sm_120	96 GB HBM	PCIe	TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica)	Also validated — both TP configs + GSM8K-50 cross-check, 3 extra patches

Both platforms serve cuda graphs ON. Same artifact, no weight changes between SKUs.

Benchmarks

Quality (hardware-invariant — measured on B300)

Measured 2026-05-21 on 4× B300 SXM6 AC (TP=4 for quant configs, TP=8 for BF16 reference which doesn't fit at TP=4). Greedy, temperature 0. The same artifact serves on RTX PRO 6000 Blackwell with no weight changes; GSM8K-50 cross-check: 88% strict TP=2 / 90% strict TP=4 on RTX PRO 6000 vs 91.81% strict full-set on B300 (within noise).

Benchmark	Setting	This artifact	BF16 + MTP reference	`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8` (no MTP)
AIME 2024	raw pass@1, thinking=high, max_tokens=65536	25/30 = 83.33%	25/30 = 83.33%	27/30 = 90.00%
AIME 2024	non-truncated pass@1	24/25 = 96.00%	25/26 = 96.15%	27/28 = 96.43%
AIME 2024	wall-clock for 30 problems @ bs=8	476 s	490 s	1405 s
GSM8K	8-shot, strict-match	0.9181	0.9484 / 0.9522 (no-MTP / MTP)	0.910 (self-reported)
GSM8K	8-shot, flexible-extract	0.9515	0.9477 / 0.9515	not reported
MMLU-Pro	5-shot, custom-extract	0.8113	not measured	not reported
HumanEval	pass@1 (EvalPlus)	0.915	not measured	0.896
HumanEval+	pass@1 (EvalPlus)	0.848	not measured	0.860
IFEval	prompt-strict (B300)	0.8540	not measured	0.8207
IFEval	prompt-strict (RTX PRO 6000 TP=4, 2026-05-24, JSON evidence)	0.8429 (-1.1pp vs B300)	—	—

On raw AIME pass@1, RedHat scores higher (27/30 vs ours 25/30) — but the gap is entirely truncation rate at the 65K max_tokens cap (RedHat truncated 2/30, ours 5/30). On non-truncated pass@1, all three configs are within 0.4 pt of each other (96.0–96.4%). Quantization quality is equivalent on AIME 2024; the differentiator is wall-clock.

Throughput

4× B300 SXM6 (sm_103a, NVLink, TP=4)

Same hardware, same TP=4, same prompts as the quality table.

Workload	Operating point	This artifact	RedHat NVFP4 (no MTP)	Ratio
AIME 2024 reasoning (thinking=high, bs=8)	wall-clock for 30 problems	476 s	1405 s	2.95×
AIME 2024 reasoning	per-request median output tok/s	182.9	99.6	1.84×
Coding (HumanEval chat, bs=1)	output tok/s	278.68	131.06	2.13×
Coding (HumanEval chat, bs=4)	output tok/s	649.35	417.87	1.55×
Coding (HumanEval chat, bs=8)	output tok/s	1104.89	673.12	1.64×
Coding (HumanEval chat, bs=16)	output tok/s	1577.20	1007.78	1.56×

Two ratios to disambiguate:

Pure decode throughput: at bs=1 chat coding, 2.13× faster. On AIME reasoning at bs=8, per-request median is 182.9 vs 99.6 tok/s — 1.84×. The decode ratio is workload-dependent (acceptance % varies) but lands in the 1.8–2.1× range across measured workloads.
AIME batch wall-clock: 1405 s / 476 s = 2.95×. This includes the truncation-rate differential at 65K — 5/30 of our responses truncated vs 2/30 of RedHat's, and truncated responses run to the cap, inflating RedHat's total wall-clock. The 2.95× number is "time to run AIME 2024 end-to-end," not "raw decode speed."

4× RTX PRO 6000 Blackwell (sm_120, PCIe, TP=2 and TP=4)

Validated 2026-05-23 on a Brev familiar-teal-worm instance. Per-replica vllm bench serve random 256-in/256-out, num_speculative_tokens=1 (SM 12.0 caps spec at k=1). MTP-on for all rows.

Config	bs=1 output tok/s	bs=4 output tok/s	bs=16 output tok/s	bs=1 TPOT median	MTP acceptance	GSM8K-50 strict
TP=2	94.6	218.5	360.5	9.05 ms	70–73%	88%
TP=4	101.0	254.0	440.1	8.20 ms	67–75%	90%

At bs=16, TP=4 is 1.22× faster per-replica than TP=2 on this hardware — opposite of B300, where TP=4 beats TP=8 due to NVFP4 tensor-core underutilization. RTX PRO 6000's slower PCIe interconnect plus lower per-GPU compute means extra parallelism still pays off at all batch sizes measured.

For context on the same RTX PRO 6000 box, the W4A16-FP8-MTP sibling measured 98.83 tok/s at TP=2 bs=1 — equivalent decode throughput, with NVFP4 trading ~4% per-replica throughput for ~10% smaller on-disk footprint (172 GB vs 159 GB).

AIME-2024 deep thinking-mode concurrency sweep (2026-05-25, TP=4)

cuda graphs ON (capture sizes [1,2,4,8]), MTP num_speculative_tokens=1, max-model-len=16384. Bench JSONs at canada-quant/dsv4-flash-nvfp4-fp8-mtp/benchmarks/rtxpro6000/.

Concurrency	Correct/30	Stop / Length	Errors	Wall (s)	Problems/min	MTP accept	Speedup vs c=1
c=1 (sequential)	24/30 (80.0%)	22 / 8	0	1453.9	1.24	90.61%	1.0×
c=2	23/30 (76.7%)	23 / 7	0	787.6	2.29	90.75%	1.85×
c=4	21/30 (70.0%)	20 / 10	0	386.6	4.66	90.93%	3.76×
c=8	(terminated)	—	—	—	—	—	—

Findings:

0 errors and 0 stopped-but-wrong at c=1/2/4. Every wrong answer is length-truncated at max_tokens, not a quality issue — non-truncated pass@1 is essentially 100%.
MTP acceptance stable at 90.6–90.9% across c=1/c=2/c=4. The NVFP4 flashinfer_trtllm MoE backend on SM 12.0 is rock-solid under all tested concurrencies (unlike the W4A16 sibling's Marlin MoE path — see Card D for that story).
c=8 throughput collapse: TP=4 with no NVLink (PCIe-only) drops combined throughput from 450 t/s @ c=4 to ~38 t/s @ c=8 — a 12× per-request slowdown. MTP itself stayed healthy; the bottleneck is TP-allreduce communication over PCIe at high concurrency. Recommendation for higher aggregate throughput on RTX PRO 6000: run 2 replicas at TP=2 instead of 1 replica at TP=4 c=8.

MTP draft-token acceptance per workload (B300, bs=1, k=2)

Workload	Acceptance
Random prompts (1024 in / 512 out)	10.75%
Code, raw completion (HumanEval `/v1/completions`)	67.29%
Code, chat-templated (HumanEval `/v1/chat/completions`, bs=1)	87.96%
Code, chat-templated, bs=4 / bs=8 / bs=16	88.27% / 87.92% / 88.19%
Instruction following (IFEval)	~58.5%
AIME 2024 reasoning (thinking=high)	81.60%

Acceptance does not degrade under batching — flat at 88.0% ± 0.4% across bs=1 to bs=16 on chat-templated coding. RTX PRO 6000 acceptance lands in 67–75% on the random-prompt workload (256-in/256-out, not directly comparable to the workload-specific rows above).

Quick start

One-line installer (applies all common patches):

curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash

Serve with MTP spec-decode (B300):

CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
  vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Without spec-decode:

CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
  vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8

Recommended TP:

B300: TP=4. TP=8 is slower than TP=4 at bs≥4 by up to 21.6% — per-rank MoE expert shards at TP=8 underutilize NVFP4 tensor-core kernels.
RTX PRO 6000: TP=4 with reduced cudagraph captures + --max-num-seqs 8 --max-num-batched-tokens 2048 to fit memory. TP=2 also works; expect 1.22× lower per-replica throughput at bs=16.

Quantization recipe

Property	Value
Dataset	`HuggingFaceH4/ultrachat_200k` train_sft (V4 chat template)
Samples	64 × max_seq_len 512 × batch_size 1, seed 42
Modifier class	`QuantizationModifier` (not GPTQ — Hessian-reduce path hangs on multi-rank B300)
Hardware	calibration on B300

Calibration corpus is 12× smaller than RedHat's reference recipe (64 vs 768 samples). On the benchmarks measured, GSM8K / HumanEval / IFEval / MMLU-Pro / AIME-non-truncated all land within noise of the reference. The visible cost of reduced coverage is AIME truncation rate (5/30 vs RedHat's 2/30 at the 65K max_tokens cap), consistent with looser calibration scales producing less-converging reasoning trajectories. A v0.2 recipe with 768 samples is planned.

Group	Modules	Scheme	Format
attention	`wq_a, wq_b, wkv, wo_a, wo_b` (and fused variants)	FP8_BLOCK 128×128, weight static + input dynamic FP8 group=128	`float-quantized`
experts	`w1, w2, w3` per expert	NVFP4 group=16, weight static + input dynamic "local" FP4 group=16	`nvfp4-pack-quantized`
ignored	`lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*`	unquantized (BF16)	n/a
MTP block (`mtp.0.*`)	all 799 keys	unquantized (BF16, preserved verbatim)	n/a

vLLM build

Common patches (all platforms)

PR	Purpose	Status
`vllm-project/vllm#43248`	`bool()` wrap on `is_static_input_scheme`	open
`vllm-project/vllm#43288`	`.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up	open
`vllm-project/vllm#43290`	`weight_scale_inv`-or-`weight_scale` fallback	open
`vllm-project/vllm#43319`	MTP-quant-detect from safetensors header + BF16 `wo_a` fallback path	open

The one-line installer applies all four automatically.

RTX PRO 6000 Blackwell (SM 12.0) only

Three SM 12.0-specific patches required on top of the four common patches. Diffs in patches/sm120_*.diff in the source repo. Full rationale at docs/RECIPE_RTX6000PRO.md.

VLLM_TEST_FORCE_FP8_MARLIN=1 env var — bypasses the NVFP4 MoE backend selector's swiglu_limit filter (no FLASHINFER_TRTLLM NVFP4 kernel auto-selects on SM 12.0).
weight_scale_inv-or-weight_scale fallback in Marlin's scaled_mm/marlin.py (PR #43290 covers attention.py only; SM 12.0 also hits Marlin's pre-process site).
Skip Marlin pre-processing for layers tagged is_bmm=True — DSV4 wo_a/wo_b/compressor.wkv use the SM 12.0 Triton fp8_einsum kernel directly; Marlin's tile-layout repack breaks the original (N, K) layout the einsum expects.

B300 deployments can skip all three.

Honest limitations

AIME truncation rate at 65K — 5/30 of responses hit the cap on long reasoning traces vs RedHat's 2/30. Consistent with the 12×-smaller calibration corpus producing less-converging reasoning trajectories. Non-truncated pass@1 is at parity with RedHat. v0.2 with 768 samples planned.
NVFP4 MoE backend selector on SM 12.0 — no FLASHINFER_TRTLLM kernel auto-selects, requires the VLLM_TEST_FORCE_FP8_MARLIN=1 env var to route through Marlin. Native NVFP4 SM 12.0 kernels exist in upstream vLLM (csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu) but aren't picked by the backend selector (vllm-project/vllm#31085).
k=1 cap on RTX PRO 6000 — SM 12.0 caps spec-decode at num_speculative_tokens=1; B300 supports k=2.
AIME thinking acceptance @ 81.60% is lower than the chat-code 87.96% headline — workload-dependent, expected, called out for transparency.
IFEval re-bench 2026-05-24 (RTX PRO 6000 TP=4) — close to published B300 numbers but slightly lower. A fresh lm_eval ifeval --apply_chat_template num_concurrent=16 measurement on RTX PRO 6000 TP=4 (post-PR-#40923-rebuild) returned prompt_strict 0.8429, prompt_loose 0.8780, inst_strict 0.8945, inst_loose 0.9185 — within 0.6–1.5 pp of the published markdown numbers (0.8540 / 0.8928 / 0.9005 / 0.9293). The published numbers likely came from B300 (the primary benchmark platform); RTX PRO 6000 measurements are slightly lower but consistent. Raw JSON evidence now committed at benchmarks/rtxpro6000/ifeval_2026_05_24.json. Originally flagged as "no on-disk JSON evidence"; that gap is now closed.

Reproduction

Full replication recipe at docs/recipes/nvfp4_fp8_mtp_replication.md — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).

Upstream contributions filed during this work

PR / Issue	Description	Status
`vllm-project/vllm#43248`	`bool()` wrap on `is_static_input_scheme`	open
`vllm-project/vllm#43288`	`.get("scale_fmt", "ue8m0")` defensive + BF16 follow-up	open
`vllm-project/vllm#43290`	`weight_scale_inv`-or-`weight_scale` fallback	open
`vllm-project/vllm#43319`	MTP-quant-detect from safetensors + BF16 `wo_a` fallback	open
`vllm-project/vllm#43297`	`(1,)`-shape `global_scale` loader broadcast (issue)	open
`vllm-project/vllm#43304`	MTP draft inherits main quant scheme (issue)	partially addressed by #43319
`vllm-project/llm-compressor#2745`	MTP inference-mode crash	open
`vllm-project/compressed-tensors#711`	sharded-module load path	open

PR vllm-project/vllm#42209 (sychen52, xinli-sw, pavanimajety, zyongye — NVIDIA) which added the DSV4 NVFP4 MoE kernel merged 2026-05-22; this artifact serves on top of that.

Changes

Date	Change
2026-05-21	Initial release on B300 — GSM8K 0.9181, HumanEval 0.915, IFEval 0.8540, MTP acceptance 87.96% on chat-code
2026-05-23	RTX PRO 6000 Blackwell (SM 12.0) validation added. TP=2 and TP=4 confirmed, MTP acceptance 67–75%, GSM8K-50 within noise of B300
2026-05-24	Cross-card finding: AIME c=4 thinking-mode on RTX PRO 6000 shows this NVFP4 artifact produces 1/30 token-corrupted generations vs the W4A16-MTP sibling's 14/30 corrupted on the same hardware + vLLM build. The W4A16 sibling has a Marlin MoE decode race on SM 12.0; this NVFP4 artifact via `flashinfer_trtllm` MoE is the recommended deployment for batched thinking-mode on RTX PRO 6000. Filed upstream: `jasl/vllm#12`.

Files in the artifact

35 sharded model-*.safetensors files + model.safetensors.index.json (172 GB total)
config.json — vLLM-compatible quantization_config with fused targets + W8A8 input_activations
tokenizer.json, tokenizer_config.json, generation_config.json — upstream DSV4-Flash
chat_template.jinja — upstream DSV4-Flash (unchanged)
recipe.yaml — the llm-compressor calibration recipe
README.md — this file

Citation

@misc{canada-quant-dsv4-flash-nvfp4-fp8-mtp-2026,
  title  = {DeepSeek-V4-Flash NVFP4-FP8 with MTP preserved for vLLM speculative decoding},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.

Acknowledgments

DeepSeek for V4-Flash and the MTP architecture.
RedHat AI for the NVFP4-FP8 reference recipe.
PR #42209 contributors (sychen52, xinli-sw, pavanimajety, zyongye) for the DSV4 NVFP4 MoE kernel work that made serving possible.
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (W4A16 sibling) for the alias-injection pattern and MTP acceptance methodology.
vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.

Downloads last month: 81

Safetensors

Model size

171B params

Tensor type

F32

I32

BF16

F8_E4M3

Model tree for canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(55)

this model