- canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP
NVFP4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — same quantization math as RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 but with the MTP block preserved in the saved weights so vLLM can load it with --speculative-config method=mtp.
TL;DR
| Recommended hardware | 4× B300 TP=4 · or RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated |
| Quality | GSM8K 91.81% strict (8-shot); MMLU-Pro 81.13%; HumanEval pass@1 0.915 (EvalPlus) |
| Throughput | 278.68 output tok/s @ bs=1 chat-code on B300 TP=4 (2.13× vs RedHat NVFP4); RTX PRO 6000 94.6 @ TP=2 / 101.0 @ TP=4 at bs=1 |
| MTP acceptance | 87.96% on chat-code at bs=1 / k=2 — flat across bs=1 to bs=16 |
| Spec-decode speedup | 1.8–2.1× decode vs RedHat NVFP4 (workload-dependent) |
| Differentiator | Only V4-Flash NVFP4 quant where --speculative-config method=mtp actually fires — RedHat's artifact dropped MTP during calibration load |
Family / related artifacts
| Repo | Role | Relation to this artifact |
|---|---|---|
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP |
sibling | W4A16 routed experts (Hopper-compatible), MTP retained — same MTP-preservation pattern. Note: on RTX PRO 6000 (SM 12.0) the W4A16 sibling's Marlin MoE decode path corrupts ~50% of generations under concurrent thinking-mode load. For batched thinking-mode workloads on SM 12.0, this NVFP4 artifact is the recommended choice. See Card D's Honest limitations and the debug log. |
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 |
predecessor (no-MTP baseline) | W4A16 + FP8 without MTP — broadest hardware compatibility |
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP |
larger sibling | Same NVFP4 + MTP recipe applied to V4-Pro; B300-only deployment |
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 |
upstream reference | Same quant math; MTP block dropped by transformers silent-strip (the bug this artifact fixes) |
Why this exists
The HF transformers DSV4 modeling class declares _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"], which silently strips MTP keys during the calibration load path. RedHat's NVFP4-FP8 artifact ran through that path, so their saved weights don't include MTP — and serving cannot use V4-Flash's spec-decode head. This artifact patches the modeling class during calibration so MTP keys (mtp.0.*, 799 tensors) survive at BF16. The result: an NVFP4 artifact that's structurally identical to RedHat's on the math, but loadable with --speculative-config method=mtp for ~2× decode speedup.
Architecture & precision
Base model
| Property | Value |
|---|---|
| Total parameters | |
| Decoder layers | 43 |
| Routed experts / layer | 256 (top-K = 6) |
| Hidden size | 4096 |
| Base BF16 size | ~600 GB |
| Quantized size | 172 GB across 35 safetensors shards |
Component precisions
| Component | Format | Method |
|---|---|---|
Routed FFN experts (w1, w2, w3 per expert) |
NVFP4 group=16 | weight static + input dynamic "local" FP4 group=16, nvfp4-pack-quantized |
Attention path (wq_a, wq_b, wkv, wo_a, wo_b and fused) |
FP8_BLOCK 128×128 | weight static + input dynamic FP8 group=128, float-quantized |
MTP block (mtp.0.*) |
BF16 | Preserved verbatim (799 tensors) |
lm_head, embed_tokens, norms, ffn.gate, ffn.shared_experts, attn compressor, attn indexer, attn_sink, hc_* |
BF16 | Unquantized |
Hardware validated
| Platform | SM | HBM/GPU | Interconnect | TP | Role |
|---|---|---|---|---|---|
| 4× NVIDIA B300 SXM6 AC | 10.3, sm_103a | 288 GB HBM3e | NVLink | 4 (TP=8 for BF16 reference) | Primary — all accuracy + throughput numbers |
| 4× NVIDIA RTX PRO 6000 Blackwell Server Edition | 12.0, sm_120 | 96 GB HBM | PCIe | TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica) | Also validated — both TP configs + GSM8K-50 cross-check, 3 extra patches |
Both platforms serve cuda graphs ON. Same artifact, no weight changes between SKUs.
Benchmarks
Quality (hardware-invariant — measured on B300)
Measured 2026-05-21 on 4× B300 SXM6 AC (TP=4 for quant configs, TP=8 for BF16 reference which doesn't fit at TP=4). Greedy, temperature 0. The same artifact serves on RTX PRO 6000 Blackwell with no weight changes; GSM8K-50 cross-check: 88% strict TP=2 / 90% strict TP=4 on RTX PRO 6000 vs 91.81% strict full-set on B300 (within noise).
| Benchmark | Setting | This artifact | BF16 + MTP reference | RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 (no MTP) |
|---|---|---|---|---|
| AIME 2024 | raw pass@1, thinking=high, max_tokens=65536 | 25/30 = 83.33% | 25/30 = 83.33% | 27/30 = 90.00% |
| AIME 2024 | non-truncated pass@1 | 24/25 = 96.00% | 25/26 = 96.15% | 27/28 = 96.43% |
| AIME 2024 | wall-clock for 30 problems @ bs=8 | 476 s | 490 s | 1405 s |
| GSM8K | 8-shot, strict-match | 0.9181 | 0.9484 / 0.9522 (no-MTP / MTP) | 0.910 (self-reported) |
| GSM8K | 8-shot, flexible-extract | 0.9515 | 0.9477 / 0.9515 | not reported |
| MMLU-Pro | 5-shot, custom-extract | 0.8113 | not measured | not reported |
| HumanEval | pass@1 (EvalPlus) | 0.915 | not measured | 0.896 |
| HumanEval+ | pass@1 (EvalPlus) | 0.848 | not measured | 0.860 |
| IFEval | prompt-strict (B300) | 0.8540 | not measured | 0.8207 |
| IFEval | prompt-strict (RTX PRO 6000 TP=4, 2026-05-24, JSON evidence) | 0.8429 (-1.1pp vs B300) | — | — |
On raw AIME pass@1, RedHat scores higher (27/30 vs ours 25/30) — but the gap is entirely truncation rate at the 65K max_tokens cap (RedHat truncated 2/30, ours 5/30). On non-truncated pass@1, all three configs are within 0.4 pt of each other (96.0–96.4%). Quantization quality is equivalent on AIME 2024; the differentiator is wall-clock.
Throughput
4× B300 SXM6 (sm_103a, NVLink, TP=4)
Same hardware, same TP=4, same prompts as the quality table.
| Workload | Operating point | This artifact | RedHat NVFP4 (no MTP) | Ratio |
|---|---|---|---|---|
| AIME 2024 reasoning (thinking=high, bs=8) | wall-clock for 30 problems | 476 s | 1405 s | 2.95× |
| AIME 2024 reasoning | per-request median output tok/s | 182.9 | 99.6 | 1.84× |
| Coding (HumanEval chat, bs=1) | output tok/s | 278.68 | 131.06 | 2.13× |
| Coding (HumanEval chat, bs=4) | output tok/s | 649.35 | 417.87 | 1.55× |
| Coding (HumanEval chat, bs=8) | output tok/s | 1104.89 | 673.12 | 1.64× |
| Coding (HumanEval chat, bs=16) | output tok/s | 1577.20 | 1007.78 | 1.56× |
Two ratios to disambiguate:
- Pure decode throughput: at bs=1 chat coding, 2.13× faster. On AIME reasoning at bs=8, per-request median is 182.9 vs 99.6 tok/s — 1.84×. The decode ratio is workload-dependent (acceptance % varies) but lands in the 1.8–2.1× range across measured workloads.
- AIME batch wall-clock: 1405 s / 476 s = 2.95×. This includes the truncation-rate differential at 65K — 5/30 of our responses truncated vs 2/30 of RedHat's, and truncated responses run to the cap, inflating RedHat's total wall-clock. The 2.95× number is "time to run AIME 2024 end-to-end," not "raw decode speed."
4× RTX PRO 6000 Blackwell (sm_120, PCIe, TP=2 and TP=4)
Validated 2026-05-23 on a Brev familiar-teal-worm instance. Per-replica vllm bench serve random 256-in/256-out, num_speculative_tokens=1 (SM 12.0 caps spec at k=1). MTP-on for all rows.
| Config | bs=1 output tok/s | bs=4 output tok/s | bs=16 output tok/s | bs=1 TPOT median | MTP acceptance | GSM8K-50 strict |
|---|---|---|---|---|---|---|
| TP=2 | 94.6 | 218.5 | 360.5 | 9.05 ms | 70–73% | 88% |
| TP=4 | 101.0 | 254.0 | 440.1 | 8.20 ms | 67–75% | 90% |
At bs=16, TP=4 is 1.22× faster per-replica than TP=2 on this hardware — opposite of B300, where TP=4 beats TP=8 due to NVFP4 tensor-core underutilization. RTX PRO 6000's slower PCIe interconnect plus lower per-GPU compute means extra parallelism still pays off at all batch sizes measured.
For context on the same RTX PRO 6000 box, the W4A16-FP8-MTP sibling measured 98.83 tok/s at TP=2 bs=1 — equivalent decode throughput, with NVFP4 trading ~4% per-replica throughput for ~10% smaller on-disk footprint (172 GB vs 159 GB).
AIME-2024 deep thinking-mode concurrency sweep (2026-05-25, TP=4)
cuda graphs ON (capture sizes [1,2,4,8]), MTP num_speculative_tokens=1, max-model-len=16384. Bench JSONs at canada-quant/dsv4-flash-nvfp4-fp8-mtp/benchmarks/rtxpro6000/.
| Concurrency | Correct/30 | Stop / Length | Errors | Wall (s) | Problems/min | MTP accept | Speedup vs c=1 |
|---|---|---|---|---|---|---|---|
| c=1 (sequential) | 24/30 (80.0%) | 22 / 8 | 0 | 1453.9 | 1.24 | 90.61% | 1.0× |
| c=2 | 23/30 (76.7%) | 23 / 7 | 0 | 787.6 | 2.29 | 90.75% | 1.85× |
| c=4 | 21/30 (70.0%) | 20 / 10 | 0 | 386.6 | 4.66 | 90.93% | 3.76× |
| c=8 | (terminated) | — | — | — | — | — | — |
Findings:
- 0 errors and 0 stopped-but-wrong at c=1/2/4. Every wrong answer is length-truncated at
max_tokens, not a quality issue — non-truncated pass@1 is essentially 100%. - MTP acceptance stable at 90.6–90.9% across c=1/c=2/c=4. The NVFP4
flashinfer_trtllmMoE backend on SM 12.0 is rock-solid under all tested concurrencies (unlike the W4A16 sibling's Marlin MoE path — see Card D for that story). - c=8 throughput collapse: TP=4 with no NVLink (PCIe-only) drops combined throughput from 450 t/s @ c=4 to ~38 t/s @ c=8 — a 12× per-request slowdown. MTP itself stayed healthy; the bottleneck is TP-allreduce communication over PCIe at high concurrency. Recommendation for higher aggregate throughput on RTX PRO 6000: run 2 replicas at TP=2 instead of 1 replica at TP=4 c=8.
MTP draft-token acceptance per workload (B300, bs=1, k=2)
| Workload | Acceptance |
|---|---|
| Random prompts (1024 in / 512 out) | 10.75% |
Code, raw completion (HumanEval /v1/completions) |
67.29% |
Code, chat-templated (HumanEval /v1/chat/completions, bs=1) |
87.96% |
| Code, chat-templated, bs=4 / bs=8 / bs=16 | 88.27% / 87.92% / 88.19% |
| Instruction following (IFEval) | ~58.5% |
| AIME 2024 reasoning (thinking=high) | 81.60% |
Acceptance does not degrade under batching — flat at 88.0% ± 0.4% across bs=1 to bs=16 on chat-templated coding. RTX PRO 6000 acceptance lands in 67–75% on the random-prompt workload (256-in/256-out, not directly comparable to the workload-specific rows above).
Quick start
One-line installer (applies all common patches):
curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash
Serve with MTP spec-decode (B300):
CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
Without spec-decode:
CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8
Recommended TP:
- B300: TP=4. TP=8 is slower than TP=4 at bs≥4 by up to 21.6% — per-rank MoE expert shards at TP=8 underutilize NVFP4 tensor-core kernels.
- RTX PRO 6000: TP=4 with reduced cudagraph captures +
--max-num-seqs 8 --max-num-batched-tokens 2048to fit memory. TP=2 also works; expect 1.22× lower per-replica throughput at bs=16.
Quantization recipe
| Property | Value |
|---|---|
| Dataset | HuggingFaceH4/ultrachat_200k train_sft (V4 chat template) |
| Samples | 64 × max_seq_len 512 × batch_size 1, seed 42 |
| Modifier class | QuantizationModifier (not GPTQ — Hessian-reduce path hangs on multi-rank B300) |
| Hardware | calibration on B300 |
Calibration corpus is 12× smaller than RedHat's reference recipe (64 vs 768 samples). On the benchmarks measured, GSM8K / HumanEval / IFEval / MMLU-Pro / AIME-non-truncated all land within noise of the reference. The visible cost of reduced coverage is AIME truncation rate (5/30 vs RedHat's 2/30 at the 65K max_tokens cap), consistent with looser calibration scales producing less-converging reasoning trajectories. A v0.2 recipe with 768 samples is planned.
| Group | Modules | Scheme | Format |
|---|---|---|---|
| attention | wq_a, wq_b, wkv, wo_a, wo_b (and fused variants) |
FP8_BLOCK 128×128, weight static + input dynamic FP8 group=128 | float-quantized |
| experts | w1, w2, w3 per expert |
NVFP4 group=16, weight static + input dynamic "local" FP4 group=16 | nvfp4-pack-quantized |
| ignored | lm_head, embed_tokens, norms, ffn.gate, ffn.shared_experts, attn compressor, attn indexer, attn_sink, hc_* |
unquantized (BF16) | n/a |
MTP block (mtp.0.*) |
all 799 keys | unquantized (BF16, preserved verbatim) | n/a |
vLLM build
Common patches (all platforms)
| PR | Purpose | Status |
|---|---|---|
vllm-project/vllm#43248 |
bool() wrap on is_static_input_scheme |
open |
vllm-project/vllm#43288 |
.get("scale_fmt", "ue8m0") on missing key + BF16 getattr follow-up |
open |
vllm-project/vllm#43290 |
weight_scale_inv-or-weight_scale fallback |
open |
vllm-project/vllm#43319 |
MTP-quant-detect from safetensors header + BF16 wo_a fallback path |
open |
The one-line installer applies all four automatically.
RTX PRO 6000 Blackwell (SM 12.0) only
Three SM 12.0-specific patches required on top of the four common patches. Diffs in patches/sm120_*.diff in the source repo. Full rationale at docs/RECIPE_RTX6000PRO.md.
VLLM_TEST_FORCE_FP8_MARLIN=1env var — bypasses the NVFP4 MoE backend selector'sswiglu_limitfilter (noFLASHINFER_TRTLLMNVFP4 kernel auto-selects on SM 12.0).weight_scale_inv-or-weight_scalefallback in Marlin'sscaled_mm/marlin.py(PR #43290 coversattention.pyonly; SM 12.0 also hits Marlin's pre-process site).- Skip Marlin pre-processing for layers tagged
is_bmm=True— DSV4wo_a/wo_b/compressor.wkvuse the SM 12.0 Tritonfp8_einsumkernel directly; Marlin's tile-layout repack breaks the original(N, K)layout the einsum expects.
B300 deployments can skip all three.
Honest limitations
- AIME truncation rate at 65K — 5/30 of responses hit the cap on long reasoning traces vs RedHat's 2/30. Consistent with the 12×-smaller calibration corpus producing less-converging reasoning trajectories. Non-truncated pass@1 is at parity with RedHat. v0.2 with 768 samples planned.
- NVFP4 MoE backend selector on SM 12.0 — no
FLASHINFER_TRTLLMkernel auto-selects, requires theVLLM_TEST_FORCE_FP8_MARLIN=1env var to route through Marlin. Native NVFP4 SM 12.0 kernels exist in upstream vLLM (csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu) but aren't picked by the backend selector (vllm-project/vllm#31085). - k=1 cap on RTX PRO 6000 — SM 12.0 caps spec-decode at
num_speculative_tokens=1; B300 supports k=2. - AIME thinking acceptance @ 81.60% is lower than the chat-code 87.96% headline — workload-dependent, expected, called out for transparency.
- IFEval re-bench 2026-05-24 (RTX PRO 6000 TP=4) — close to published B300 numbers but slightly lower. A fresh
lm_eval ifeval --apply_chat_template num_concurrent=16measurement on RTX PRO 6000 TP=4 (post-PR-#40923-rebuild) returned prompt_strict 0.8429, prompt_loose 0.8780, inst_strict 0.8945, inst_loose 0.9185 — within 0.6–1.5 pp of the published markdown numbers (0.8540 / 0.8928 / 0.9005 / 0.9293). The published numbers likely came from B300 (the primary benchmark platform); RTX PRO 6000 measurements are slightly lower but consistent. Raw JSON evidence now committed atbenchmarks/rtxpro6000/ifeval_2026_05_24.json. Originally flagged as "no on-disk JSON evidence"; that gap is now closed.
Reproduction
Full replication recipe at docs/recipes/nvfp4_fp8_mtp_replication.md — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).
Upstream contributions filed during this work
| PR / Issue | Description | Status |
|---|---|---|
vllm-project/vllm#43248 |
bool() wrap on is_static_input_scheme |
open |
vllm-project/vllm#43288 |
.get("scale_fmt", "ue8m0") defensive + BF16 follow-up |
open |
vllm-project/vllm#43290 |
weight_scale_inv-or-weight_scale fallback |
open |
vllm-project/vllm#43319 |
MTP-quant-detect from safetensors + BF16 wo_a fallback |
open |
vllm-project/vllm#43297 |
(1,)-shape global_scale loader broadcast (issue) |
open |
vllm-project/vllm#43304 |
MTP draft inherits main quant scheme (issue) | partially addressed by #43319 |
vllm-project/llm-compressor#2745 |
MTP inference-mode crash | open |
vllm-project/compressed-tensors#711 |
sharded-module load path | open |
PR vllm-project/vllm#42209 (sychen52, xinli-sw, pavanimajety, zyongye — NVIDIA) which added the DSV4 NVFP4 MoE kernel merged 2026-05-22; this artifact serves on top of that.
Changes
| Date | Change |
|---|---|
| 2026-05-21 | Initial release on B300 — GSM8K 0.9181, HumanEval 0.915, IFEval 0.8540, MTP acceptance 87.96% on chat-code |
| 2026-05-23 | RTX PRO 6000 Blackwell (SM 12.0) validation added. TP=2 and TP=4 confirmed, MTP acceptance 67–75%, GSM8K-50 within noise of B300 |
| 2026-05-24 | Cross-card finding: AIME c=4 thinking-mode on RTX PRO 6000 shows this NVFP4 artifact produces 1/30 token-corrupted generations vs the W4A16-MTP sibling's 14/30 corrupted on the same hardware + vLLM build. The W4A16 sibling has a Marlin MoE decode race on SM 12.0; this NVFP4 artifact via flashinfer_trtllm MoE is the recommended deployment for batched thinking-mode on RTX PRO 6000. Filed upstream: jasl/vllm#12. |
Files in the artifact
- 35 sharded
model-*.safetensorsfiles +model.safetensors.index.json(172 GB total) config.json— vLLM-compatible quantization_config with fused targets + W8A8 input_activationstokenizer.json,tokenizer_config.json,generation_config.json— upstream DSV4-Flashchat_template.jinja— upstream DSV4-Flash (unchanged)recipe.yaml— the llm-compressor calibration recipeREADME.md— this file
Citation
@misc{canada-quant-dsv4-flash-nvfp4-fp8-mtp-2026,
title = {DeepSeek-V4-Flash NVFP4-FP8 with MTP preserved for vLLM speculative decoding},
author = {Canada Quant},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP}
}
License
MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.
Acknowledgments
- DeepSeek for V4-Flash and the MTP architecture.
- RedHat AI for the NVFP4-FP8 reference recipe.
- PR
#42209contributors (sychen52, xinli-sw, pavanimajety, zyongye) for the DSV4 NVFP4 MoE kernel work that made serving possible. canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP(W4A16 sibling) for the alias-injection pattern and MTP acceptance methodology.- vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.
- Downloads last month
- 81
Model tree for canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP
Base model
deepseek-ai/DeepSeek-V4-Flash