canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

W4A16 INT4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — the first DeepSeek-V4-Flash quantization that ships a working MTP block, giving ~1.5× speculative decoding (spec-decode) speedup at bs=1 with no quality cost. Extends the W4A16-FP8 predecessor by patching the transformers calibration path so the MTP block survives the load.

TL;DR

Recommended hardware RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated · or 8× H200 TP=2
Quality GSM8K 93.71% (8-shot strict); HumanEval 84.76% pass@1; MMLU 86.88%
Throughput RTX PRO 6000 98.83 @ TP=2 / 107.32 @ TP=4 at bs=1; 88.35 on H200 TP=2
MTP acceptance 89% calibrated workload / 70% on random prompts at bs=1 k=1
Spec-decode speedup 1.49× at bs=1, k=1 (TPOT 6.02 ms vs 8.93 ms, same artifact)
Differentiator First V4-Flash W4A16 quant where MTP survives the calibration load; transformers 5.8.1 silently strips MTP keys by default

Family / related artifacts

Repo Role Relation to this artifact
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 predecessor Same W4A16 + FP8 recipe; MTP dropped at load (the bug this artifact fixes)
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP sibling Same MTP-retention pattern; NVFP4 routed experts instead of W4A16 (Blackwell-native)
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP larger sibling V4-Pro at NVFP4 + MTP, B300-only deployment
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 upstream reference Original NVFP4 recipe (no MTP — same silent-drop bug)

Why this exists

The W4A16-FP8 predecessor and RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 both drop the MTP block because transformers 5.8.1's DeepseekV4PreTrainedModel declares:

_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]

which silently filters every mtp.* tensor at from_pretrained time — without warning, without error. Calibration pipelines that go through from_pretrained produce quantized main weights paired with an absent MTP block; serving falls back to plain decode, losing the ~1.5–2× spec-decode speedup that V4-Flash's architecture provides.

This artifact bypasses the silent drop, runs the full 8-rank GPTQ calibration on a 768-sample corpus against the main routed experts, preserves the MTP block unquantized in BF16, and produces a serving artifact where speculative decoding actually fires.

Architecture & precision

Base model

Property Value
Total parameters 284 B (13 B active per token)
Decoder layers 43
Routed experts / layer 256 (top-K = 6)
Hidden size 4096
Base BF16 size ~543 GB
Quantized size 159 GB (+3 GB vs predecessor for the BF16 MTP block)

Component precisions

Component Format Method
Routed experts (256 × 43 layers × 3 projections) W4A16 INT4, group_size=128, symmetric GPTQ via llm-compressor, 768 calibration samples
Attention path (wq_a, wq_b, wkv, wo_a, wo_b, indexer, compressor) FP8_BLOCK 128×128 Dynamic scales, scale_fmt=ue8m0
MTP block (mtp.0.*) BF16 Excluded from quantization, preserved verbatim
HC plumbing (hc_attn_*, hc_ffn_*, hc_head_*), attn_sink, ffn.gate.bias, indexer/compressor ape FP32 Restored post-save from BF16 source (see Upstream contributions)
head.weight (LM head) FP32 Upcast from BF16 to match sibling artifact's MTP loader path
Embeddings (embed.weight, mtp.0.emb.tok_emb.weight) BF16 Source dtype preserved

Hardware validated

Platform SM HBM/GPU Interconnect TP Role
8× NVIDIA H200 SXM5 9.0a 141 GB HBM3e NVLink 2 (4× replicas) Calibration + initial benchmarks (p5en.48xlarge)
4× NVIDIA RTX PRO 6000 Blackwell Server Edition 12.0, sm_120 96 GB HBM PCIe TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica) Workstation Blackwell deployment + $/token sweet spot

Same artifact, no weight changes between SKUs. Both validated cuda graphs ON.

Benchmarks

All numbers from the same artifact, vLLM HEAD 50d9dd902 + 4 patches cherry-picked (PRs #43248 / #43288 / #43290 / #43319).

Quality

Sampling: greedy, temperature 0. Methodology disclosed per row.

Benchmark Setting This artifact Predecessor (W4A16-FP8, no MTP) RedHat (NVFP4-FP8, no MTP) Delta
GSM8K 8-shot, strict-match 93.71% ± 0.67 95.07% (RTX PRO 6000) / 95.45% (Spark) 91.0% (self-reported) -1.28 pts vs predecessor (within 1 SE)
GSM8K 8-shot, flexible-extract 93.63% ± 0.67 95.37% (Spark) within SE
MMLU 5-shot 86.88% ± 0.27 87.27% (H200) -0.39 pts (within SE)
MMLU-Pro 5-shot, 12k prompts, custom-extract 71.28% ± 0.40 sibling NVFP4-FP8-MTP scored 81.13% on B300 — expected gap given W4A16 has more quant noise than NVFP4 on knowledge-heavy harder benchmarks
HumanEval 0-shot pass@1, --confirm_run_unsafe_code 84.76% ± 2.82 80.49% (corrected, see predecessor card "Changes") +4.27 pts vs corrected predecessor number
AIME 2024 30 problems, thinking=high 30.0% exact-match ± 8.51 high-difficulty competition math
chat-smoke (quick / quality / coding) harness 4/4 · 4/4 · 2/2 4/4 · 4/4 · 2/2 match
toolcall15 1 round, 30 points 24/30 (80%) 26/30 (87%) -2 pts — see Honest limitations

Throughput

vllm bench serve random 256-in / 256-out, MTP-spec num_speculative_tokens=1 (k=1 cap on this build — see Honest limitations), cuda graphs ON.

Hardware TP bs=1 output tok/s bs=1 TPOT median bs=4 output tok/s bs=16 output tok/s MTP acceptance @ bs=1
8× H200 2 (per replica) 88.35 6.02 ms 138.80 367.13 89% calibrated / 70% random
4× RTX PRO 6000 box TP=2 (per replica, 2 replicas fit) 98.83 8.55 ms 219.53 482.61 71%
4× RTX PRO 6000 box TP=4 (single replica) 107.32 7.77 ms 221.52 584.04 68%

Per-replica, RTX PRO 6000 wins output throughput at every batch size; H200 still wins per-token TPOT median.

MTP draft-token acceptance per workload

Same artifact, bs=1, k=1.

Workload Prompts Accepted / emitted Acceptance
Random 256-token prompts (200 samples) random 21024 / 30058 69.94%
Code, raw completion (15 short signature+docstring prompts) code-raw 1847 / 1988 92.91%
Chat-templated prose (15 prompts) chat-prose 1946 / 2376 81.90%
Raw natural language (15 continuation prompts) nl-raw 1745 / 2086 83.65%

Spec-decode wins at low concurrency (single-user interactive). At bs≥4 the verifier is already filling its batch lane, so extra verifier passes add overhead without saving wall-clock — matches the sibling artifact's framing of bs=1 as the headline operating point.

Cost per output token (node-level)

Boxes priced for cloud-rented hardware. Single-replica numbers measured; multi-replica totals are linear extrapolation.

Box Replicas bs=1 total tok/s bs=16 total tok/s $/h $/(1000 tok/h) at bs=1
p5en.48xlarge (8× H200) 4× TP=2 ~353 ~1468 $98 $278
g7e.24xlarge (4× RTX PRO 6000) 2× TP=2 ~198 ~965 $19.92 $101
g7e.24xlarge (4× RTX PRO 6000) 1× TP=4 107.32 584.04 $19.92 $186

At bs=1 (interactive), RTX PRO 6000 2×TP=2 is ~2.7× cheaper than H200 4×TP=2. At bs=16 the gap narrows because H200's per-replica throughput scales better with batch — H200 wins absolute throughput when you can fill it; RTX wins on $/token unless you genuinely need >1500 tok/s aggregate output.

Quick start

RTX PRO 6000 Blackwell (recommended)

# 1. Bootstrap vLLM (~25 min for source build)
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp
bash scripts/bootstrap_rtx6000pro.sh

# 2. Extra pins
source ~/venv-serve/bin/activate
pip install --quiet "flashinfer-python==0.6.8.post1" "flashinfer-cubin==0.6.8.post1" \
    "numba==0.65.0" "tilelang==0.1.9" "apache-tvm-ffi==0.1.9" "fastsafetensors>=0.2.2"

# 3. Apply patches
python scripts/patch_v4_forcausal_packed_mapping.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_mtp_packed_mapping.py        "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_nvidia_attn_scale.py         "$(python -c 'import vllm; print(vllm.__path__[0])')"
bash   scripts/patch_wo_a_bf16_path.sh             "$(python -c 'import vllm; print(vllm.__path__[0])')"

# 4. Download artifact (159 GiB)
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
    --local-dir /scratch/weights/w4a16-fp8-mtp-gptq

# 5. One-time dequant (~1.5 min)
python scripts/dequant_compressor.py /scratch/weights/w4a16-fp8-mtp-gptq

# 6. Serve TP=2
CUDA_VISIBLE_DEVICES=0,1 bash scripts/serve_rtx6000pro.sh \
    /scratch/weights/w4a16-fp8-mtp-gptq 8000 2

H200

vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
    --tensor-parallel-size 2 \
    --kv-cache-dtype fp8 --block-size 256 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.80 \
    --no-enable-prefix-caching \
    --tokenizer-mode deepseek_v4 \
    --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
    --reasoning-parser deepseek_v4 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
    --trust-remote-code

Quantization recipe

Property Value
Dataset HuggingFaceH4/ultrachat_200k (V4 chat template)
Samples 768
Max sequence length 512
Per-rank batch size 4
Calibration hardware 8× NVIDIA H200 (p5en.48xlarge)
Walltime ~15.4h (15.09h oneshot + ~16 min save)
Per-subgraph cadence ~20 min/subgraph × 44 subgraphs (43 MoE + 1 MTP no-op)

Calibration recipe identical to the W4A16-FP8 predecessor with one change: the modeling class is patched to remove mtp.* from _keys_to_ignore_on_load_unexpected before from_pretrained, so the MTP block survives the load and is written back to the artifact at BF16.

vLLM build

Common patches (all platforms)

PR Purpose Status
vllm-project/vllm#43248 bool() wrap on is_static_input_scheme open
vllm-project/vllm#43288 .get("scale_fmt", "ue8m0") on missing key + BF16 getattr follow-up open
vllm-project/vllm#43290 weight_scale_inv-or-weight_scale fallback (attention) open
vllm-project/vllm#43319 MTP-quant-detect from safetensors header + BF16 wo_a fallback path open

RTX PRO 6000 Blackwell (SM 12.0) only

Patch Purpose
packed_modules_mapping on DeepseekV4ForCausalLM + DeepSeekV4MTP Required as of ds4-sm120-experimental@abad5dc71
BF16 wo_a path for MTP block Static weight.dtype == bfloat16 check (dynamo-safe)
Compressor/indexer FP8 → BF16 dequant preprocess One-time, ~1.5 min
--disable-custom-all-reduce No NVLink between RTX PRO 6000 boards
CMakeLists USE_SABI 3.11 removal For Python 3.10

H200 deployments need only the four common patches.

Honest limitations

  1. k=1 cap on spec-decode — current vLLM build limits num_speculative_tokens to 1 due to DeepGemm kernel assertion next_n == 1 or next_n == 2 in smxx_fp8_fp4_paged_mqa_logits.hpp:233. vLLM passes next_n = num_speculative_tokens + 1, so practical k is 1. The FLASHINFER_MLA_SPARSE attention backend hits the same kernel-side assertion. With the assertion relaxed, expect bs=1 speedup to rise from 1.49× to ~1.85× (matching sibling NVFP4 artifact's k=2 published number).
  2. toolcall15 -2 pts vs predecessor — model-routing regressions on chain-completion (TC-07 stopped mid-chain to ask a clarifying question) and multi-tool extraction (TC-06 returned both translations as content text instead of routing two translate calls). Quality-wise the model completes the underlying intent; the harness scores tool-call-protocol fidelity, not task completion. Not a parser issue (confirmed by replay through --tool-call-parser deepseek_v4).
  3. GSM8K -1.3 pts vs predecessor's 8-shot strict-match — within one SE, but technically below. Likely calibration-set sensitivity rather than recipe drift (recipe is identical, hardware differs).
  4. NVFP4 native kernels on RTX PRO 6000 not auto-selected — even though csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu exists in upstream vLLM, the backend selector doesn't pick it (vllm-project/vllm#31085). Until that lands, the sibling NVFP4 artifact on this hardware would route through Marlin too. This artifact's W4A16 path is the tested choice for RTX PRO 6000.

Reproduction

Full pipeline at canada-quant/dsv4-flash-w4a16-fp8-mtp. From a fresh 8× H200 box:

# Phase 0 — bootstrap (venv-calib + venv-serve + vendor + apply patches)
bash scripts/bootstrap_p5en_h200.sh

# Phase 1 — download upstream + dequant to BF16-MTP source (~30 min, ~660 GB)
bash scripts/phase1_dequant.sh

# Phase 2 — GPTQ calibration (8 ranks, ~15h wall)
bash scripts/run_phase2.sh

# Phase 3 — postprocess (rename + config patch + FP32 restore + MTP aliases)
bash scripts/postprocess_phase2.sh

# Phase 4 — verify
python scripts/verify_option_y.py /scratch/weights/w4a16-fp8-mtp-gptq

# Phase 5 — serve (see Quick start above for serve command)

Upstream contributions filed during this work

Contribution Description Status
transformers — save_pretrained silent FP32 → BF16 downcast 417 tensors specified as FP32 in DeepSeek's release spec (HC plumbing, gate bias, attn_sink, indexer/compressor ape) are silently written as BF16 by save_pretrained when model torch_dtype is BF16. Workaround: postprocess restore from BF16 source via scripts/fixup_artifact.py. Upstream filing pending local
vLLM — MTP loader silently skips top-level head.weight + embed.weight DeepSeekV4MTP.load_weights calls name.replace("mtp.0.", "") which no-ops on non-mtp.0.* keys; get_spec_layer_idx returns None → loop skips. head.weight and embed.weight never reach shared_head.head / embed_tokens → uninitialized → 0% MTP acceptance with no load-time error. Workaround: postprocess injects mtp.0.head.weight and mtp.0.emb.tok_emb.weight as duplicates. Upstream filing pending local
vLLM — DeepGemm paged_mqa_logits asserts on num_speculative_tokens > 1 smxx_fp8_fp4_paged_mqa_logits.hpp:233 enforces next_n == 1 or next_n == 2. With next_n = k+1, practical k cap is 1. Caps spec-decode speedup at 1.49× vs sibling's published 2.03× at k=2 upstream (DeepGemm) — filing pending
vllm-project/vllm#43248 bool() wrap on is_static_input_scheme open
vllm-project/vllm#43288 scale_fmt defensive .get() + BF16 getattr wrap open
vllm-project/vllm#43290 weight_scale_inv-or-weight_scale fallback open
vllm-project/vllm#43319 MTP-quant-detect from safetensors header + BF16 wo_a fallback path open

Changes

Date Change
2026-05-22 Initial release on H200. GSM8K 93.71% strict, MMLU 86.88%, HumanEval 84.76%, MTP acceptance 89% on calibrated workload / 70% on random prompts
2026-05-24 RTX PRO 6000 Blackwell (SM 12.0) added. TP=2 and TP=4 both validated, chat-smoke 4/4 PASS, MTP acceptance 68-72%, MTP-on per-replica throughput 98.83 tok/s @ TP=2 / 107.32 @ TP=4. Per-replica throughput beats H200 at every batch size. vllm-project/vllm#41511 (Marlin TP > 2 bug) did not fire on this build

Files in the artifact

  • 4 sharded model-*.safetensors files + model.safetensors.index.json (159 GB total)
  • config.json — vLLM-compatible quantization_config with MTP block excluded
  • tokenizer.json, tokenizer_config.json, generation_config.json, chat_template.jinja — upstream DSV4-Flash
  • recipe.yaml — the llm-compressor GPTQ recipe
  • README.md — this file

Citation

@misc{canada-quant-dsv4-flash-w4a16-fp8-mtp-2026,
  title  = {DeepSeek-V4-Flash W4A16-FP8 with BF16 MTP retained for vLLM speculative decoding},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash. Review at the upstream repo before commercial deployment.

Acknowledgments

Downloads last month
78
Safetensors
Model size
51B params
Tensor type
I64
·
F32
·
I32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

Quantized
(54)
this model