canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

W4A16 INT4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — the first DeepSeek-V4-Flash quantization that ships a working MTP block, giving ~1.5× speculative decoding (spec-decode) speedup at bs=1 with no quality cost. Extends the W4A16-FP8 predecessor by patching the transformers calibration path so the MTP block survives the load.

TL;DR


Recommended hardware	RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated · or 8× H200 TP=2
Quality	GSM8K 93.71% (8-shot strict); HumanEval 84.76% pass@1; MMLU 86.88%
Throughput	RTX PRO 6000 98.83 @ TP=2 / 107.32 @ TP=4 at bs=1; 88.35 on H200 TP=2
MTP acceptance	89% calibrated workload / 70% on random prompts at bs=1 k=1
Spec-decode speedup	1.49× at bs=1, k=1 (TPOT 6.02 ms vs 8.93 ms, same artifact)
Differentiator	First V4-Flash W4A16 quant where MTP survives the calibration load; `transformers` 5.8.1 silently strips MTP keys by default

Family / related artifacts

Repo	Role	Relation to this artifact
`canada-quant/DeepSeek-V4-Flash-W4A16-FP8`	predecessor	Same W4A16 + FP8 recipe; MTP dropped at load (the bug this artifact fixes)
`canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP`	sibling	Same MTP-retention pattern; NVFP4 routed experts instead of W4A16 (Blackwell-native)
`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`	larger sibling	V4-Pro at NVFP4 + MTP, B300-only deployment
`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`	upstream reference	Original NVFP4 recipe (no MTP — same silent-drop bug)

Why this exists

The W4A16-FP8 predecessor and RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 both drop the MTP block because transformers 5.8.1's DeepseekV4PreTrainedModel declares:

_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]

which silently filters every mtp.* tensor at from_pretrained time — without warning, without error. Calibration pipelines that go through from_pretrained produce quantized main weights paired with an absent MTP block; serving falls back to plain decode, losing the ~1.5–2× spec-decode speedup that V4-Flash's architecture provides.

This artifact bypasses the silent drop, runs the full 8-rank GPTQ calibration on a 768-sample corpus against the main routed experts, preserves the MTP block unquantized in BF16, and produces a serving artifact where speculative decoding actually fires.

Architecture & precision

Base model

Property	Value
Total parameters	~~284 B (~~13 B active per token)
Decoder layers	43
Routed experts / layer	256 (top-K = 6)
Hidden size	4096
Base BF16 size	~543 GB
Quantized size	159 GB (+3 GB vs predecessor for the BF16 MTP block)

Component precisions

Component	Format	Method
Routed experts (256 × 43 layers × 3 projections)	W4A16 INT4, group_size=128, symmetric	GPTQ via llm-compressor, 768 calibration samples
Attention path (`wq_a`, `wq_b`, `wkv`, `wo_a`, `wo_b`, indexer, compressor)	FP8_BLOCK 128×128	Dynamic scales, `scale_fmt=ue8m0`
*MTP block (`mtp.0.`)**	BF16	Excluded from quantization, preserved verbatim
HC plumbing (`hc_attn_`, `hc_ffn_`, `hc_head_*`), `attn_sink`, `ffn.gate.bias`, indexer/compressor `ape`	FP32	Restored post-save from BF16 source (see Upstream contributions)
`head.weight` (LM head)	FP32	Upcast from BF16 to match sibling artifact's MTP loader path
Embeddings (`embed.weight`, `mtp.0.emb.tok_emb.weight`)	BF16	Source dtype preserved

Hardware validated

Platform	SM	HBM/GPU	Interconnect	TP	Role
8× NVIDIA H200 SXM5	9.0a	141 GB HBM3e	NVLink	2 (4× replicas)	Calibration + initial benchmarks (`p5en.48xlarge`)
4× NVIDIA RTX PRO 6000 Blackwell Server Edition	12.0, sm_120	96 GB HBM	PCIe	TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica)	Workstation Blackwell deployment + $/token sweet spot

Same artifact, no weight changes between SKUs. Both validated cuda graphs ON.

Benchmarks

All numbers from the same artifact, vLLM HEAD 50d9dd902 + 4 patches cherry-picked (PRs #43248 / #43288 / #43290 / #43319).

Quality

Sampling: greedy, temperature 0. Methodology disclosed per row.

Benchmark	Setting	This artifact	Predecessor (W4A16-FP8, no MTP)	RedHat (NVFP4-FP8, no MTP)	Delta
GSM8K	8-shot, strict-match	93.71% ± 0.67	95.07% (RTX PRO 6000) / 95.45% (Spark)	91.0% (self-reported)	-1.28 pts vs predecessor (within 1 SE)
GSM8K	8-shot, flexible-extract	93.63% ± 0.67	95.37% (Spark)	—	within SE
MMLU	5-shot	86.88% ± 0.27	87.27% (H200)	—	-0.39 pts (within SE)
MMLU-Pro	5-shot, 12k prompts, custom-extract	71.28% ± 0.40	—	—	sibling NVFP4-FP8-MTP scored 81.13% on B300 — expected gap given W4A16 has more quant noise than NVFP4 on knowledge-heavy harder benchmarks
HumanEval	0-shot pass@1, `--confirm_run_unsafe_code`	84.76% ± 2.82	80.49% (corrected, see predecessor card "Changes")	—	+4.27 pts vs corrected predecessor number
AIME 2024	30 problems, thinking=high	30.0% exact-match ± 8.51	—	—	high-difficulty competition math
chat-smoke (quick / quality / coding)	harness	4/4 · 4/4 · 2/2	4/4 · 4/4 · 2/2	—	match
toolcall15	1 round, 30 points	24/30 (80%)	26/30 (87%)	—	-2 pts — see Honest limitations

Throughput

vllm bench serve random 256-in / 256-out, MTP-spec num_speculative_tokens=1 (k=1 cap on this build — see Honest limitations), cuda graphs ON.

Hardware	TP	bs=1 output tok/s	bs=1 TPOT median	bs=4 output tok/s	bs=16 output tok/s	MTP acceptance @ bs=1
8× H200	2 (per replica)	88.35	6.02 ms	138.80	367.13	89% calibrated / 70% random
4× RTX PRO 6000 box	TP=2 (per replica, 2 replicas fit)	98.83	8.55 ms	219.53	482.61	71%
4× RTX PRO 6000 box	TP=4 (single replica)	107.32	7.77 ms	221.52	584.04	68%

Per-replica, RTX PRO 6000 wins output throughput at every batch size; H200 still wins per-token TPOT median.

MTP draft-token acceptance per workload

Same artifact, bs=1, k=1.

Workload	Prompts	Accepted / emitted	Acceptance
Random 256-token prompts (200 samples)	random	21024 / 30058	69.94%
Code, raw completion (15 short signature+docstring prompts)	code-raw	1847 / 1988	92.91%
Chat-templated prose (15 prompts)	chat-prose	1946 / 2376	81.90%
Raw natural language (15 continuation prompts)	nl-raw	1745 / 2086	83.65%

Spec-decode wins at low concurrency (single-user interactive). At bs≥4 the verifier is already filling its batch lane, so extra verifier passes add overhead without saving wall-clock — matches the sibling artifact's framing of bs=1 as the headline operating point.

Cost per output token (node-level)

Boxes priced for cloud-rented hardware. Single-replica numbers measured; multi-replica totals are linear extrapolation.

Box	Replicas	bs=1 total tok/s	bs=16 total tok/s	$/h	$/(1000 tok/h) at bs=1
`p5en.48xlarge` (8× H200)	4× TP=2	~353	~1468	$98	$278
`g7e.24xlarge` (4× RTX PRO 6000)	2× TP=2	~198	~965	$19.92	$101
`g7e.24xlarge` (4× RTX PRO 6000)	1× TP=4	107.32	584.04	$19.92	$186

At bs=1 (interactive), RTX PRO 6000 2×TP=2 is ~2.7× cheaper than H200 4×TP=2. At bs=16 the gap narrows because H200's per-replica throughput scales better with batch — H200 wins absolute throughput when you can fill it; RTX wins on $/token unless you genuinely need >1500 tok/s aggregate output.

Quick start

RTX PRO 6000 Blackwell (recommended)

# 1. Bootstrap vLLM (~25 min for source build)
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp
bash scripts/bootstrap_rtx6000pro.sh

# 2. Extra pins
source ~/venv-serve/bin/activate
pip install --quiet "flashinfer-python==0.6.8.post1" "flashinfer-cubin==0.6.8.post1" \
    "numba==0.65.0" "tilelang==0.1.9" "apache-tvm-ffi==0.1.9" "fastsafetensors>=0.2.2"

# 3. Apply patches
python scripts/patch_v4_forcausal_packed_mapping.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_mtp_packed_mapping.py        "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_nvidia_attn_scale.py         "$(python -c 'import vllm; print(vllm.__path__[0])')"
bash   scripts/patch_wo_a_bf16_path.sh             "$(python -c 'import vllm; print(vllm.__path__[0])')"

# 4. Download artifact (159 GiB)
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
    --local-dir /scratch/weights/w4a16-fp8-mtp-gptq

# 5. One-time dequant (~1.5 min)
python scripts/dequant_compressor.py /scratch/weights/w4a16-fp8-mtp-gptq

# 6. Serve TP=2
CUDA_VISIBLE_DEVICES=0,1 bash scripts/serve_rtx6000pro.sh \
    /scratch/weights/w4a16-fp8-mtp-gptq 8000 2

H200

vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
    --tensor-parallel-size 2 \
    --kv-cache-dtype fp8 --block-size 256 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.80 \
    --no-enable-prefix-caching \
    --tokenizer-mode deepseek_v4 \
    --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
    --reasoning-parser deepseek_v4 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
    --trust-remote-code

Quantization recipe

Property	Value
Dataset	`HuggingFaceH4/ultrachat_200k` (V4 chat template)
Samples	768
Max sequence length	512
Per-rank batch size	4
Calibration hardware	8× NVIDIA H200 (`p5en.48xlarge`)
Walltime	~15.4h (15.09h oneshot + ~16 min save)
Per-subgraph cadence	~20 min/subgraph × 44 subgraphs (43 MoE + 1 MTP no-op)

Calibration recipe identical to the W4A16-FP8 predecessor with one change: the modeling class is patched to remove mtp.* from _keys_to_ignore_on_load_unexpected before from_pretrained, so the MTP block survives the load and is written back to the artifact at BF16.

vLLM build

Common patches (all platforms)

PR	Purpose	Status
`vllm-project/vllm#43248`	`bool()` wrap on `is_static_input_scheme`	open
`vllm-project/vllm#43288`	`.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up	open
`vllm-project/vllm#43290`	`weight_scale_inv`-or-`weight_scale` fallback (attention)	open
`vllm-project/vllm#43319`	MTP-quant-detect from safetensors header + BF16 `wo_a` fallback path	open

RTX PRO 6000 Blackwell (SM 12.0) only

Patch	Purpose
`packed_modules_mapping` on `DeepseekV4ForCausalLM` + `DeepSeekV4MTP`	Required as of `ds4-sm120-experimental@abad5dc71`
BF16 `wo_a` path for MTP block	Static `weight.dtype == bfloat16` check (dynamo-safe)
Compressor/indexer FP8 → BF16 dequant preprocess	One-time, ~1.5 min
`--disable-custom-all-reduce`	No NVLink between RTX PRO 6000 boards
CMakeLists `USE_SABI 3.11` removal	For Python 3.10

H200 deployments need only the four common patches.

Honest limitations

k=1 cap on spec-decode — current vLLM build limits num_speculative_tokens to 1 due to DeepGemm kernel assertion next_n == 1 or next_n == 2 in smxx_fp8_fp4_paged_mqa_logits.hpp:233. vLLM passes next_n = num_speculative_tokens + 1, so practical k is 1. The FLASHINFER_MLA_SPARSE attention backend hits the same kernel-side assertion. With the assertion relaxed, expect bs=1 speedup to rise from 1.49× to ~1.85× (matching sibling NVFP4 artifact's k=2 published number).
toolcall15 -2 pts vs predecessor — model-routing regressions on chain-completion (TC-07 stopped mid-chain to ask a clarifying question) and multi-tool extraction (TC-06 returned both translations as content text instead of routing two translate calls). Quality-wise the model completes the underlying intent; the harness scores tool-call-protocol fidelity, not task completion. Not a parser issue (confirmed by replay through --tool-call-parser deepseek_v4).
GSM8K -1.3 pts vs predecessor's 8-shot strict-match — within one SE, but technically below. Likely calibration-set sensitivity rather than recipe drift (recipe is identical, hardware differs).
NVFP4 native kernels on RTX PRO 6000 not auto-selected — even though csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu exists in upstream vLLM, the backend selector doesn't pick it (vllm-project/vllm#31085). Until that lands, the sibling NVFP4 artifact on this hardware would route through Marlin too. This artifact's W4A16 path is the tested choice for RTX PRO 6000.

Reproduction

Full pipeline at canada-quant/dsv4-flash-w4a16-fp8-mtp. From a fresh 8× H200 box:

# Phase 0 — bootstrap (venv-calib + venv-serve + vendor + apply patches)
bash scripts/bootstrap_p5en_h200.sh

# Phase 1 — download upstream + dequant to BF16-MTP source (~30 min, ~660 GB)
bash scripts/phase1_dequant.sh

# Phase 2 — GPTQ calibration (8 ranks, ~15h wall)
bash scripts/run_phase2.sh

# Phase 3 — postprocess (rename + config patch + FP32 restore + MTP aliases)
bash scripts/postprocess_phase2.sh

# Phase 4 — verify
python scripts/verify_option_y.py /scratch/weights/w4a16-fp8-mtp-gptq

# Phase 5 — serve (see Quick start above for serve command)

Upstream contributions filed during this work

Contribution	Description	Status
transformers — `save_pretrained` silent FP32 → BF16 downcast	417 tensors specified as FP32 in DeepSeek's release spec (HC plumbing, gate bias, attn_sink, indexer/compressor `ape`) are silently written as BF16 by `save_pretrained` when model `torch_dtype` is BF16. Workaround: postprocess restore from BF16 source via `scripts/fixup_artifact.py`. Upstream filing pending	local
vLLM — MTP loader silently skips top-level `head.weight` + `embed.weight`	`DeepSeekV4MTP.load_weights` calls `name.replace("mtp.0.", "")` which no-ops on non-`mtp.0.` keys; `get_spec_layer_idx` returns None → loop skips. `head.weight` and `embed.weight` never reach `shared_head.head` / `embed_tokens` → uninitialized → 0% MTP acceptance with no load-time error*. Workaround: postprocess injects `mtp.0.head.weight` and `mtp.0.emb.tok_emb.weight` as duplicates. Upstream filing pending	local
vLLM — DeepGemm `paged_mqa_logits` asserts on `num_speculative_tokens > 1`	`smxx_fp8_fp4_paged_mqa_logits.hpp:233` enforces `next_n == 1 or next_n == 2`. With `next_n = k+1`, practical k cap is 1. Caps spec-decode speedup at 1.49× vs sibling's published 2.03× at k=2	upstream (DeepGemm) — filing pending
`vllm-project/vllm#43248`	`bool()` wrap on `is_static_input_scheme`	open
`vllm-project/vllm#43288`	`scale_fmt` defensive `.get()` + BF16 `getattr` wrap	open
`vllm-project/vllm#43290`	`weight_scale_inv`-or-`weight_scale` fallback	open
`vllm-project/vllm#43319`	MTP-quant-detect from safetensors header + BF16 `wo_a` fallback path	open

Changes

Date	Change
2026-05-22	Initial release on H200. GSM8K 93.71% strict, MMLU 86.88%, HumanEval 84.76%, MTP acceptance 89% on calibrated workload / 70% on random prompts
2026-05-24	RTX PRO 6000 Blackwell (SM 12.0) added. TP=2 and TP=4 both validated, chat-smoke 4/4 PASS, MTP acceptance 68-72%, MTP-on per-replica throughput 98.83 tok/s @ TP=2 / 107.32 @ TP=4. Per-replica throughput beats H200 at every batch size. `vllm-project/vllm#41511` (Marlin TP > 2 bug) did not fire on this build

Files in the artifact

4 sharded model-*.safetensors files + model.safetensors.index.json (159 GB total)
config.json — vLLM-compatible quantization_config with MTP block excluded
tokenizer.json, tokenizer_config.json, generation_config.json, chat_template.jinja — upstream DSV4-Flash
recipe.yaml — the llm-compressor GPTQ recipe
README.md — this file

Citation

@misc{canada-quant-dsv4-flash-w4a16-fp8-mtp-2026,
  title  = {DeepSeek-V4-Flash W4A16-FP8 with BF16 MTP retained for vLLM speculative decoding},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash. Review at the upstream repo before commercial deployment.

Acknowledgments

DeepSeek for the base model + MTP architecture + inference reference.
jasl (jasl/vllm and jasl/vllm-ds4-sm120-harness) for the vLLM build pins (ds4-sm120-experimental for H200; ds4-sm120-preview-dev for RTX PRO 6000 SM 12.0) and the benchmark harness.
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 (predecessor) for the proven recipe topology this artifact extends with MTP.
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP (sibling) for the alias-injection pattern and MTP acceptance methodology.
vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.

Downloads last month: 78

Safetensors

Model size

51B params

Tensor type

I64

F32

I32

BF16

F8_E4M3

Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(54)

this model