- canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
W4A16 INT4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — the first DeepSeek-V4-Flash quantization that ships a working MTP block, giving ~1.5× speculative decoding (spec-decode) speedup at bs=1 with no quality cost. Extends the W4A16-FP8 predecessor by patching the transformers calibration path so the MTP block survives the load.
TL;DR
| Recommended hardware | RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated · or 8× H200 TP=2 |
| Quality | GSM8K 93.71% (8-shot strict); HumanEval 84.76% pass@1; MMLU 86.88% |
| Throughput | RTX PRO 6000 98.83 @ TP=2 / 107.32 @ TP=4 at bs=1; 88.35 on H200 TP=2 |
| MTP acceptance | 89% calibrated workload / 70% on random prompts at bs=1 k=1 |
| Spec-decode speedup | 1.49× at bs=1, k=1 (TPOT 6.02 ms vs 8.93 ms, same artifact) |
| Differentiator | First V4-Flash W4A16 quant where MTP survives the calibration load; transformers 5.8.1 silently strips MTP keys by default |
Family / related artifacts
| Repo | Role | Relation to this artifact |
|---|---|---|
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 |
predecessor | Same W4A16 + FP8 recipe; MTP dropped at load (the bug this artifact fixes) |
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP |
sibling | Same MTP-retention pattern; NVFP4 routed experts instead of W4A16 (Blackwell-native) |
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP |
larger sibling | V4-Pro at NVFP4 + MTP, B300-only deployment |
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 |
upstream reference | Original NVFP4 recipe (no MTP — same silent-drop bug) |
Why this exists
The W4A16-FP8 predecessor and RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 both drop the MTP block because transformers 5.8.1's DeepseekV4PreTrainedModel declares:
_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]
which silently filters every mtp.* tensor at from_pretrained time — without warning, without error. Calibration pipelines that go through from_pretrained produce quantized main weights paired with an absent MTP block; serving falls back to plain decode, losing the ~1.5–2× spec-decode speedup that V4-Flash's architecture provides.
This artifact bypasses the silent drop, runs the full 8-rank GPTQ calibration on a 768-sample corpus against the main routed experts, preserves the MTP block unquantized in BF16, and produces a serving artifact where speculative decoding actually fires.
Architecture & precision
Base model
| Property | Value |
|---|---|
| Total parameters | |
| Decoder layers | 43 |
| Routed experts / layer | 256 (top-K = 6) |
| Hidden size | 4096 |
| Base BF16 size | ~543 GB |
| Quantized size | 159 GB (+3 GB vs predecessor for the BF16 MTP block) |
Component precisions
| Component | Format | Method |
|---|---|---|
| Routed experts (256 × 43 layers × 3 projections) | W4A16 INT4, group_size=128, symmetric | GPTQ via llm-compressor, 768 calibration samples |
Attention path (wq_a, wq_b, wkv, wo_a, wo_b, indexer, compressor) |
FP8_BLOCK 128×128 | Dynamic scales, scale_fmt=ue8m0 |
MTP block (mtp.0.*) |
BF16 | Excluded from quantization, preserved verbatim |
HC plumbing (hc_attn_*, hc_ffn_*, hc_head_*), attn_sink, ffn.gate.bias, indexer/compressor ape |
FP32 | Restored post-save from BF16 source (see Upstream contributions) |
head.weight (LM head) |
FP32 | Upcast from BF16 to match sibling artifact's MTP loader path |
Embeddings (embed.weight, mtp.0.emb.tok_emb.weight) |
BF16 | Source dtype preserved |
Hardware validated
| Platform | SM | HBM/GPU | Interconnect | TP | Role |
|---|---|---|---|---|---|
| 8× NVIDIA H200 SXM5 | 9.0a | 141 GB HBM3e | NVLink | 2 (4× replicas) | Calibration + initial benchmarks (p5en.48xlarge) |
| 4× NVIDIA RTX PRO 6000 Blackwell Server Edition | 12.0, sm_120 | 96 GB HBM | PCIe | TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica) | Workstation Blackwell deployment + $/token sweet spot |
Same artifact, no weight changes between SKUs. Both validated cuda graphs ON.
Benchmarks
All numbers from the same artifact, vLLM HEAD 50d9dd902 + 4 patches cherry-picked (PRs #43248 / #43288 / #43290 / #43319).
Quality
Sampling: greedy, temperature 0. Methodology disclosed per row.
| Benchmark | Setting | This artifact | Predecessor (W4A16-FP8, no MTP) | RedHat (NVFP4-FP8, no MTP) | Delta |
|---|---|---|---|---|---|
| GSM8K | 8-shot, strict-match | 93.71% ± 0.67 | 95.07% (RTX PRO 6000) / 95.45% (Spark) | 91.0% (self-reported) | -1.28 pts vs predecessor (within 1 SE) |
| GSM8K | 8-shot, flexible-extract | 93.63% ± 0.67 | 95.37% (Spark) | — | within SE |
| MMLU | 5-shot | 86.88% ± 0.27 | 87.27% (H200) | — | -0.39 pts (within SE) |
| MMLU-Pro | 5-shot, 12k prompts, custom-extract | 71.28% ± 0.40 | — | — | sibling NVFP4-FP8-MTP scored 81.13% on B300 — expected gap given W4A16 has more quant noise than NVFP4 on knowledge-heavy harder benchmarks |
| HumanEval | 0-shot pass@1, --confirm_run_unsafe_code |
84.76% ± 2.82 | 80.49% (corrected, see predecessor card "Changes") | — | +4.27 pts vs corrected predecessor number |
| AIME 2024 | 30 problems, thinking=high | 30.0% exact-match ± 8.51 | — | — | high-difficulty competition math |
| chat-smoke (quick / quality / coding) | harness | 4/4 · 4/4 · 2/2 | 4/4 · 4/4 · 2/2 | — | match |
| toolcall15 | 1 round, 30 points | 24/30 (80%) | 26/30 (87%) | — | -2 pts — see Honest limitations |
Throughput
vllm bench serve random 256-in / 256-out, MTP-spec num_speculative_tokens=1 (k=1 cap on this build — see Honest limitations), cuda graphs ON.
| Hardware | TP | bs=1 output tok/s | bs=1 TPOT median | bs=4 output tok/s | bs=16 output tok/s | MTP acceptance @ bs=1 |
|---|---|---|---|---|---|---|
| 8× H200 | 2 (per replica) | 88.35 | 6.02 ms | 138.80 | 367.13 | 89% calibrated / 70% random |
| 4× RTX PRO 6000 box | TP=2 (per replica, 2 replicas fit) | 98.83 | 8.55 ms | 219.53 | 482.61 | 71% |
| 4× RTX PRO 6000 box | TP=4 (single replica) | 107.32 | 7.77 ms | 221.52 | 584.04 | 68% |
Per-replica, RTX PRO 6000 wins output throughput at every batch size; H200 still wins per-token TPOT median.
MTP draft-token acceptance per workload
Same artifact, bs=1, k=1.
| Workload | Prompts | Accepted / emitted | Acceptance |
|---|---|---|---|
| Random 256-token prompts (200 samples) | random | 21024 / 30058 | 69.94% |
| Code, raw completion (15 short signature+docstring prompts) | code-raw | 1847 / 1988 | 92.91% |
| Chat-templated prose (15 prompts) | chat-prose | 1946 / 2376 | 81.90% |
| Raw natural language (15 continuation prompts) | nl-raw | 1745 / 2086 | 83.65% |
Spec-decode wins at low concurrency (single-user interactive). At bs≥4 the verifier is already filling its batch lane, so extra verifier passes add overhead without saving wall-clock — matches the sibling artifact's framing of bs=1 as the headline operating point.
Cost per output token (node-level)
Boxes priced for cloud-rented hardware. Single-replica numbers measured; multi-replica totals are linear extrapolation.
| Box | Replicas | bs=1 total tok/s | bs=16 total tok/s | $/h | $/(1000 tok/h) at bs=1 |
|---|---|---|---|---|---|
p5en.48xlarge (8× H200) |
4× TP=2 | ~353 | ~1468 | $98 | $278 |
g7e.24xlarge (4× RTX PRO 6000) |
2× TP=2 | ~198 | ~965 | $19.92 | $101 |
g7e.24xlarge (4× RTX PRO 6000) |
1× TP=4 | 107.32 | 584.04 | $19.92 | $186 |
At bs=1 (interactive), RTX PRO 6000 2×TP=2 is ~2.7× cheaper than H200 4×TP=2. At bs=16 the gap narrows because H200's per-replica throughput scales better with batch — H200 wins absolute throughput when you can fill it; RTX wins on $/token unless you genuinely need >1500 tok/s aggregate output.
Quick start
RTX PRO 6000 Blackwell (recommended)
# 1. Bootstrap vLLM (~25 min for source build)
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp
bash scripts/bootstrap_rtx6000pro.sh
# 2. Extra pins
source ~/venv-serve/bin/activate
pip install --quiet "flashinfer-python==0.6.8.post1" "flashinfer-cubin==0.6.8.post1" \
"numba==0.65.0" "tilelang==0.1.9" "apache-tvm-ffi==0.1.9" "fastsafetensors>=0.2.2"
# 3. Apply patches
python scripts/patch_v4_forcausal_packed_mapping.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_mtp_packed_mapping.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
python scripts/patch_nvidia_attn_scale.py "$(python -c 'import vllm; print(vllm.__path__[0])')"
bash scripts/patch_wo_a_bf16_path.sh "$(python -c 'import vllm; print(vllm.__path__[0])')"
# 4. Download artifact (159 GiB)
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
--local-dir /scratch/weights/w4a16-fp8-mtp-gptq
# 5. One-time dequant (~1.5 min)
python scripts/dequant_compressor.py /scratch/weights/w4a16-fp8-mtp-gptq
# 6. Serve TP=2
CUDA_VISIBLE_DEVICES=0,1 bash scripts/serve_rtx6000pro.sh \
/scratch/weights/w4a16-fp8-mtp-gptq 8000 2
H200
vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 --block-size 256 \
--max-model-len 4096 \
--gpu-memory-utilization 0.80 \
--no-enable-prefix-caching \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 --enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--trust-remote-code
Quantization recipe
| Property | Value |
|---|---|
| Dataset | HuggingFaceH4/ultrachat_200k (V4 chat template) |
| Samples | 768 |
| Max sequence length | 512 |
| Per-rank batch size | 4 |
| Calibration hardware | 8× NVIDIA H200 (p5en.48xlarge) |
| Walltime | ~15.4h (15.09h oneshot + ~16 min save) |
| Per-subgraph cadence | ~20 min/subgraph × 44 subgraphs (43 MoE + 1 MTP no-op) |
Calibration recipe identical to the W4A16-FP8 predecessor with one change: the modeling class is patched to remove mtp.* from _keys_to_ignore_on_load_unexpected before from_pretrained, so the MTP block survives the load and is written back to the artifact at BF16.
vLLM build
Common patches (all platforms)
| PR | Purpose | Status |
|---|---|---|
vllm-project/vllm#43248 |
bool() wrap on is_static_input_scheme |
open |
vllm-project/vllm#43288 |
.get("scale_fmt", "ue8m0") on missing key + BF16 getattr follow-up |
open |
vllm-project/vllm#43290 |
weight_scale_inv-or-weight_scale fallback (attention) |
open |
vllm-project/vllm#43319 |
MTP-quant-detect from safetensors header + BF16 wo_a fallback path |
open |
RTX PRO 6000 Blackwell (SM 12.0) only
| Patch | Purpose |
|---|---|
packed_modules_mapping on DeepseekV4ForCausalLM + DeepSeekV4MTP |
Required as of ds4-sm120-experimental@abad5dc71 |
BF16 wo_a path for MTP block |
Static weight.dtype == bfloat16 check (dynamo-safe) |
| Compressor/indexer FP8 → BF16 dequant preprocess | One-time, ~1.5 min |
--disable-custom-all-reduce |
No NVLink between RTX PRO 6000 boards |
CMakeLists USE_SABI 3.11 removal |
For Python 3.10 |
H200 deployments need only the four common patches.
Honest limitations
- k=1 cap on spec-decode — current vLLM build limits
num_speculative_tokensto 1 due to DeepGemm kernel assertionnext_n == 1 or next_n == 2insmxx_fp8_fp4_paged_mqa_logits.hpp:233. vLLM passesnext_n = num_speculative_tokens + 1, so practical k is 1. TheFLASHINFER_MLA_SPARSEattention backend hits the same kernel-side assertion. With the assertion relaxed, expect bs=1 speedup to rise from 1.49× to ~1.85× (matching sibling NVFP4 artifact's k=2 published number). - toolcall15 -2 pts vs predecessor — model-routing regressions on chain-completion (TC-07 stopped mid-chain to ask a clarifying question) and multi-tool extraction (TC-06 returned both translations as content text instead of routing two
translatecalls). Quality-wise the model completes the underlying intent; the harness scores tool-call-protocol fidelity, not task completion. Not a parser issue (confirmed by replay through--tool-call-parser deepseek_v4). - GSM8K -1.3 pts vs predecessor's 8-shot strict-match — within one SE, but technically below. Likely calibration-set sensitivity rather than recipe drift (recipe is identical, hardware differs).
- NVFP4 native kernels on RTX PRO 6000 not auto-selected — even though
csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cuexists in upstream vLLM, the backend selector doesn't pick it (vllm-project/vllm#31085). Until that lands, the sibling NVFP4 artifact on this hardware would route through Marlin too. This artifact's W4A16 path is the tested choice for RTX PRO 6000.
Reproduction
Full pipeline at canada-quant/dsv4-flash-w4a16-fp8-mtp. From a fresh 8× H200 box:
# Phase 0 — bootstrap (venv-calib + venv-serve + vendor + apply patches)
bash scripts/bootstrap_p5en_h200.sh
# Phase 1 — download upstream + dequant to BF16-MTP source (~30 min, ~660 GB)
bash scripts/phase1_dequant.sh
# Phase 2 — GPTQ calibration (8 ranks, ~15h wall)
bash scripts/run_phase2.sh
# Phase 3 — postprocess (rename + config patch + FP32 restore + MTP aliases)
bash scripts/postprocess_phase2.sh
# Phase 4 — verify
python scripts/verify_option_y.py /scratch/weights/w4a16-fp8-mtp-gptq
# Phase 5 — serve (see Quick start above for serve command)
Upstream contributions filed during this work
| Contribution | Description | Status |
|---|---|---|
transformers — save_pretrained silent FP32 → BF16 downcast |
417 tensors specified as FP32 in DeepSeek's release spec (HC plumbing, gate bias, attn_sink, indexer/compressor ape) are silently written as BF16 by save_pretrained when model torch_dtype is BF16. Workaround: postprocess restore from BF16 source via scripts/fixup_artifact.py. Upstream filing pending |
local |
vLLM — MTP loader silently skips top-level head.weight + embed.weight |
DeepSeekV4MTP.load_weights calls name.replace("mtp.0.", "") which no-ops on non-mtp.0.* keys; get_spec_layer_idx returns None → loop skips. head.weight and embed.weight never reach shared_head.head / embed_tokens → uninitialized → 0% MTP acceptance with no load-time error. Workaround: postprocess injects mtp.0.head.weight and mtp.0.emb.tok_emb.weight as duplicates. Upstream filing pending |
local |
vLLM — DeepGemm paged_mqa_logits asserts on num_speculative_tokens > 1 |
smxx_fp8_fp4_paged_mqa_logits.hpp:233 enforces next_n == 1 or next_n == 2. With next_n = k+1, practical k cap is 1. Caps spec-decode speedup at 1.49× vs sibling's published 2.03× at k=2 |
upstream (DeepGemm) — filing pending |
vllm-project/vllm#43248 |
bool() wrap on is_static_input_scheme |
open |
vllm-project/vllm#43288 |
scale_fmt defensive .get() + BF16 getattr wrap |
open |
vllm-project/vllm#43290 |
weight_scale_inv-or-weight_scale fallback |
open |
vllm-project/vllm#43319 |
MTP-quant-detect from safetensors header + BF16 wo_a fallback path |
open |
Changes
| Date | Change |
|---|---|
| 2026-05-22 | Initial release on H200. GSM8K 93.71% strict, MMLU 86.88%, HumanEval 84.76%, MTP acceptance 89% on calibrated workload / 70% on random prompts |
| 2026-05-24 | RTX PRO 6000 Blackwell (SM 12.0) added. TP=2 and TP=4 both validated, chat-smoke 4/4 PASS, MTP acceptance 68-72%, MTP-on per-replica throughput 98.83 tok/s @ TP=2 / 107.32 @ TP=4. Per-replica throughput beats H200 at every batch size. vllm-project/vllm#41511 (Marlin TP > 2 bug) did not fire on this build |
Files in the artifact
- 4 sharded
model-*.safetensorsfiles +model.safetensors.index.json(159 GB total) config.json— vLLM-compatible quantization_config with MTP block excludedtokenizer.json,tokenizer_config.json,generation_config.json,chat_template.jinja— upstream DSV4-Flashrecipe.yaml— the llm-compressor GPTQ recipeREADME.md— this file
Citation
@misc{canada-quant-dsv4-flash-w4a16-fp8-mtp-2026,
title = {DeepSeek-V4-Flash W4A16-FP8 with BF16 MTP retained for vLLM speculative decoding},
author = {Canada Quant},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP}
}
License
MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash. Review at the upstream repo before commercial deployment.
Acknowledgments
- DeepSeek for the base model + MTP architecture + inference reference.
- jasl (
jasl/vllmandjasl/vllm-ds4-sm120-harness) for the vLLM build pins (ds4-sm120-experimentalfor H200;ds4-sm120-preview-devfor RTX PRO 6000 SM 12.0) and the benchmark harness. canada-quant/DeepSeek-V4-Flash-W4A16-FP8(predecessor) for the proven recipe topology this artifact extends with MTP.canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP(sibling) for the alias-injection pattern and MTP acceptance methodology.- vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.
- Downloads last month
- 78
Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
Base model
deepseek-ai/DeepSeek-V4-Flash