canada-quant/DeepSeek-V4-Flash-W4A16-FP8

Mixed-precision quantization of deepseek-ai/DeepSeek-V4-Flash — W4A16 INT4 on routed experts + FP8 block 128×128 on attention — that loads cleanly on Hopper datacenter GPUs and on consumer-grade Blackwell. Recipe topology mirrors RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8; routed-expert format is W4A16 (Marlin) instead of NVFP4 for compatibility with SM 9.x / SM 12.x kernels.

TL;DR

Recommended hardware 2× DGX Spark or 2× RTX PRO 6000, TP=2
Quality GSM8K 95.07–95.45% strict (8-shot); HumanEval pass@1 78.05–80.49% (strict, --confirm_run_unsafe_code)
Throughput 47–48 output tok/s @ bs=1 on RTX PRO 6000 TP=2 (TPOT 20.8 ms); 14–17 tok/s on DGX Spark TP=2
Differentiator Only quant of V4-Flash that serves on SM 9.x and SM 12.x; baseline for the W4A16-FP8-MTP successor

Family / related artifacts

Repo Role Relation to this artifact
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP successor Same recipe + BF16 MTP retained for 1.49× spec-decode speedup at bs=1
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP sibling NVFP4 routed experts (Blackwell-native), MTP retained
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP larger sibling V4-Pro at NVFP4 with MTP, B300-only deployment
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 upstream reference Original mixed-precision topology (NVFP4 experts + FP8 attention) we adapted to W4A16

Why this exists

DeepSeek-V4-Flash launched April 24, 2026 (284 B total / 13 B active, hybrid CSA + HCA attention, hash-routed experts). At release, no merged path through transformers + llm-compressor + vLLM existed for V4 quantization on Hopper or on SM 12.x Blackwell. RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 covered Blackwell datacenter (B100/B200, SM 10.x) via NVFP4 tcgen05 kernels, and Intel/DeepSeek-V4-Flash-W4A16-AutoRound covered W4A16 but explicitly excluded vLLM and SGLang. This artifact fills the gap: W4A16 GPTQ routed experts + FP8 block attention that serves on vLLM at TP=2 on H200 (Hopper SM 9.0a), DGX Spark (Blackwell SM 12.1a), and RTX PRO 6000 (Blackwell SM 12.0) — same weights, three SKUs.

Architecture & precision

Base model

Property Value
Total parameters 284 B (13 B active per token)
Decoder layers 43
Routed experts / layer 256 (top-K = 6)
Hidden size 4096
Base BF16 size ~543 GB
Quantized size ~143 GB
Compression ratio ~3.8×

Component precisions

Component Format Method
Routed experts (256 × 43 layers) W4A16 INT4, group_size=128, symmetric GPTQ via llm-compressor, dampening_frac=0.1
Attention path (q_a/q_b/kv/o_a/o_b, compressor, indexer) FP8_BLOCK 128×128 Dynamic, data-free
Shared experts BF16 Excluded (kylesayrs PR #41276 incompatibility)
Embeddings, lm_head, hc_head BF16 Excluded
MTP block dropped at load Removed by transformers _keys_to_ignore_on_load_unexpected — see W4A16-FP8-MTP successor for the retention recipe

Hardware validated

Platform SM HBM/GPU Interconnect TP Role
8× NVIDIA H200 SXM5 9.0a 141 GB HBM3e NVLink 2 (4× replicas) Calibration + harness baseline
2× NVIDIA DGX Spark (GB10) 12.1a 128 GB unified NVLink-C2C 2 Long-context production (1M-token graphs-ON)
2× NVIDIA RTX PRO 6000 Blackwell Server Edition 12.0, sm_120 96 GB HBM PCIe 2 Workstation Blackwell deployment

All three SKUs serve cuda graphs ON (no --enforce-eager). Same artifact, no weight changes between SKUs — only vLLM build flags and a few env vars differ.

Benchmarks

Quality

Sampling: greedy, temperature 0. lm-eval-harness via OpenAI-compatible backend pointing at the local vLLM. Methodology disclosed per row.

Benchmark Setting 8× H200 (older vLLM build) 2× DGX Spark TP=2 2× RTX PRO 6000 TP=2
GSM8K 8-shot, flexible-extract 92.87% ± 0.71 95.37% ± 0.58 94.99% ± 0.60
GSM8K 8-shot, strict-match 42.61%¹ → see note 95.45% ± 0.57 95.07% ± 0.60
MMLU 5-shot 87.27% ± 0.27 (in flight) (pending)
HumanEval 0-shot pass@1 (instruct, --confirm_run_unsafe_code) 54.27% ± 3.9² → 80.49% ± 3.10³ 80.49% ± 3.10 78.05% ± 3.24
chat-smoke (quick / quality / coding) harness 4/4 · 4/4 · 2/2 4/4 · 4/4 · 2/2 4/4 · 4/4 · 2/2
toolcall15 1 round, 30 points 26/30 (87%) 41/45 (92%)⁴ 27/30 (90%)
NIAH long-context (75K → 500K single) retrieval 4/4 retrieval 5/5 retrieval
NIAH 256K × 2 concurrent retrieval fix landed in jasl@e734ace5 4/4 (377 s)

¹ The H200 GSM8K strict-match of 42.61% was a chat-format extraction artifact, not a quality regression. The flexible-extract number (92.87%) is the comparable figure. Cross-checked on DGX Spark / RTX PRO 6000 with corrected extraction (95.07–95.45%).

² ³ HumanEval pass@1 on H200 was initially reported as 54.27% under regex-based extraction. The harness was later corrected to use --confirm_run_unsafe_code (executes generated code), which raised the same-artifact score to 80.49%. The Spark and RTX PRO 6000 runs use the corrected methodology; the H200 number is the same artifact re-scored. See Changes for the dated correction.

⁴ Spark toolcall15 is scored across 3 thinking modes (45 cases); H200 / RTX PRO 6000 are single-round (30 cases). Scores normalized to %.

Comparison caveat: the H200 numbers come from an older vLLM build (harness HEAD 85aca32, jasl/vllm@428e08e). Spark and RTX PRO 6000 numbers are on today's ds4-sm120-experimental tip. The valid same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 ↔ Blackwell deltas are informational.

Throughput

vllm bench serve random 1024-in / 1024-out, cuda graphs ON, MTP-spec n/a (this artifact ships without MTP).

Hardware TP bs=1 output tok/s bs=1 TPOT median bs=2 output tok/s bs=2 TPOT median
2× DGX Spark 2 14–17
2× DGX Spark 2 (eager fallback) 3–4
2× RTX PRO 6000 2 47.5 20.8 ms 84.0 21.7 ms

Per-stream decode rate on RTX PRO 6000 is rock-stable across concurrency (TPOT mean stays at 21 ms, p99 only 23 ms). Aggregate input+output throughput at bs=2 reaches 420 tok/s.

Quick start

vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8 \
  --served-model-name DSV4-W4A16-FP8 \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --max-model-len 16384 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code

Required env vars on SM 12.x sparse-MLA path: set VLLM_TRITON_MLA_SPARSE=1 and VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4. Without _HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel crashes during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered in _dequantize_and_gather_k_kernel (kernel falls back to a default block size that doesn't match V4-Flash's head dim). Full env block at findings/QUICKSTART_DUAL_SPARK.md §4.

Long-context (1M tokens, single stream): drop --max-num-seqs to 1, --gpu-memory-utilization to 0.90, set --max-model-len 1048576 --max-num-batched-tokens 8192 --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'.

Tensor parallelism: TP=2 is the only validated configuration. TP=1 OOMs on a single 141 GB H200; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).

RTX PRO 6000 (SM 12.0) only: set VLLM_USE_FLASHINFER_SAMPLER=0 — vLLM's FlashInfer-based top-p / top-k sampler JIT mis-parses the TORCH_CUDA_ARCH_LIST=12.0a token and incorrectly raises RuntimeError: FlashInfer requires GPUs with sm75 or higher.

Quantization recipe

Property Value
Dataset HuggingFaceH4/ultrachat_200k (V4 chat template)
Samples 768
Max sequence length 512
Per-rank batch size 4
Hardware 8× NVIDIA H200 (p5en.48xlarge)
Walltime ~14 hours

Required calibration environment

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=3600
export TORCH_NCCL_BLOCKING_WAIT=0
export NCCL_TIMEOUT=3600
export TORCH_CUDA_ARCH_LIST=9.0a
sudo mount -o remount,size=1800G /dev/shm

expandable_segments is calibration-only — must not be set during vLLM serving.

What didn't work (recorded so others don't waste cycles)

Config Result
samples=1024, bs=32, no offload, no expandable_segments OOM at Layer 3 (45–67 GiB activation alloc fail)
samples=1024, bs=8, same as above OOM at Layer 3 (32 GiB alloc fail)
samples=1024, bs=8, offload_hessians=True OOM at Layer 3 (30 GiB alloc fail; fragmentation blocks contiguous block)
samples=1024, bs=4, +offload_hessians, +expandable_segments NCCL collective timeout at Layer 22 (10 min default exceeded by per-rank drift)
samples=768, bs=4, +offload_hessians, +expandable_segments, +60min NCCL timeout Succeeded — 14h end-to-end
sequential_targets=["Linear"] (any sample count) torch.fx.proxy.TraceError on DeepseekV4Indexer.wrapped_1's data-dependent control flow — would need is_leaf_module patch to register Indexer as leaf

Recipe

from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization.quant_scheme import FP8_BLOCK, W4A16, QuantizationScheme

recipe = GPTQModifier(
    config_groups={
        "attention": QuantizationScheme(
            targets=[
                r"re:.*self_attn\.(q_a_proj|q_b_proj|kv_proj|o_a_proj|o_b_proj)$",
                r"re:.*self_attn\.compressor\.(gate_proj|kv_proj)$",
                r"re:.*self_attn\.compressor\.indexer\.(gate_proj|kv_proj|q_b_proj|weights_proj)$",
            ],
            **FP8_BLOCK,
        ),
        "experts": QuantizationScheme(
            targets=[r"re:.*mlp\.experts\.\d+\.(gate_proj|up_proj|down_proj)$"],
            **W4A16,
        ),
    },
    ignore=["lm_head"],
    offload_hessians=True,
    dampening_frac=0.1,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=512,
    num_calibration_samples=768,
    sequential_targets=["DeepseekV4DecoderLayer"],
    batch_size=4,
)

vLLM build

This artifact does not load on vanilla vLLM. Stack:

Component Pin Notes
jasl/vllm ds4-sm120-experimental (or ds4-sm120 for conservative) SM12x DSV4 support
kylesayrs deepseek-ct patch content-pinned, vendored at scripts/kylesayrs-deepseek-ct.patch Rebased successor of f910a73a93 (force-pushed out of upstream history; see issue #1)
packed_modules_mapping patch patches/packed_modules_mapping.diff Required as of abad5dc71 (2026-05-05) — kylesayrs patch doesn't add this attribute
Workspace pre-reservation patch landed upstream as jasl/vllm@1d6f5c4 Was vllm-project/vllm#41700 — no longer needs local apply

Single-file bootstrap script for dual DGX Spark: scripts/bootstrap_dsv4_spark.sh — does the whole stack zero-to-serving.

Upstream tracker: original PR #40991 (where Spark validation was posted) closed 2026-05-06; current tracker is PR #41834"[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes", branch codex/ds4-sm120-min-enable.

Honest limitations

  • No MTPtransformers 5.8.1's _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"] silently strips MTP keys during calibration load. Speculative decoding cannot fire with this artifact. The W4A16-FP8-MTP successor retains MTP via a patched calibration path and delivers 1.49× spec-decode speedup at bs=1.
  • TP > 2 blocked by vllm-project/vllm#41511 — W4A16 MoE scale-sharding bug.
  • H200 numbers from older vLLM build — H200 baseline was scored on jasl/vllm@428e08e (harness HEAD 85aca32). Same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 → Blackwell deltas are informational.
  • toolcall15 TC-06 (Multi-Value Extraction) and TC-08 (Conditional Branching) also fail on the native FP4/FP8 baseline — V4-Flash model-architecture limits, not quantization defects.
  • 2026-05-25: artifact has shipping issues on current upstream vLLM. Two problems were surfaced when attempting to load this artifact on jasl/vllm@a02a3778f (the post-PR-#40923 build the sibling W4A16-MTP card now uses): (1) Same FP8_BLOCK compressor/indexer shipping bug as the MTP sibling — current vLLM constructs those modules as plain BF16 (quant_config=None) and the artifact fails with KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'. The MTP sibling fixed this by dequantizing those weights in-artifact to BF16; this artifact has not yet had that fix applied. (2) A separate architecture-drift issue: the artifact lacks the layers.N.ffn.gate.e_score_correction_bias tensor that current upstream vLLM's DSV4 loader requires (KeyError). Either re-calibration that emits this tensor, or a defensive .get() loader patch upstream is needed. The published H200/Spark/RTX PRO 6000 numbers above remain valid for their original jasl/vllm@ds4-sm120-experimental@abad5dc71 build (2026-05-05); they do not currently reproduce on bleeding-edge vLLM. Tracking and re-verification deferred to the next session.

Reproduction

Full toolchain, scripts, patches, mission report: canada-quant/dsv4-flash-w4a16-fp8.

Single-file bootstrap (dual DGX Spark, idempotent, SSH-orchestrated):

curl -fsSLO https://raw.githubusercontent.com/canada-quant/dsv4-flash-w4a16-fp8/main/scripts/bootstrap_dsv4_spark.sh
chmod +x bootstrap_dsv4_spark.sh
./bootstrap_dsv4_spark.sh --head-host spark-a --worker-host spark-b

Upstream contributions filed during this work

PR / Issue Description Status
vllm-project/vllm#41700 Workspace pre-reservation patch landed as jasl/vllm@1d6f5c4
vllm-project/vllm#41511 Marlin MoE TP scale-sharding bug open — blocks TP>2
vllm-project/vllm#40991#41834 SM12x DeepSeek V4 base support open (jasl)
vllm-project/vllm#41276 compressed-tensors V4 attention path open (kylesayrs)

Changes

Date Change
2026-05-06 DGX Spark TP=2 production canonical at 1M-token context graphs-ON validated on ds4-sm120-experimental
2026-05-08 Kylesayrs branch f910a73a93 force-pushed out of upstream history; vendored content-pinned rebased successor d09eeb498 at scripts/kylesayrs-deepseek-ct.patch (issue #1)
2026-05-19 HumanEval methodology correction: H200 pass@1 was scored at 54.27% under regex extraction; re-scored at 80.49% with --confirm_run_unsafe_code. Same artifact, methodology change. Earlier 54.27% number is shown struck through in the quality table
2026-05-23 Workspace pre-reservation patch landed upstream as jasl/vllm@1d6f5c4; closes our #41700. No local apply needed
2026-05-24 RTX PRO 6000 Blackwell (SM 12.0) added to validated hardware — chat-smoke 4/4, toolcall15 27/30 (90%), GSM8K 95.07%, NIAH 256K × 2 concurrent PASS
2026-05-25 Two shipping issues surfaced when re-testing on current upstream vLLM (jasl/vllm@a02a3778f). (1) Same FP8 compressor/indexer load-failure as the W4A16-MTP sibling — fixable via the same in-artifact BF16 dequant; not yet applied to this artifact. (2) Architecture-drift KeyError: 'layers.N.ffn.gate.e_score_correction_bias' — Card A's older safetensors (calibrated 2026-05-06) don't contain a tensor that current vLLM's DSV4 loader expects; needs re-calibration or a defensive loader patch. Published RTX PRO 6000 numbers above remain valid for the May-5 jasl build; current-build re-verification deferred. See session_summary_2026_05_24.md.

Files in the artifact

  • 30 sharded model-*.safetensors files + model.safetensors.index.json (143 GB total)
  • config.json — vLLM-compatible quantization_config (W4A16 + FP8_BLOCK groups)
  • tokenizer.json, tokenizer_config.json, generation_config.json — upstream DSV4-Flash
  • recipe.yaml — the llm-compressor calibration recipe
  • chat_template.jinja — upstream DSV4-Flash (unchanged)
  • README.md — this file

Citation

@misc{canada-quant-dsv4-flash-w4a16-fp8-2026,
  title  = {DeepSeek-V4-Flash W4A16-FP8 for vLLM on Hopper and Blackwell},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.

Acknowledgments

Downloads last month
7,438
Safetensors
Model size
44B params
Tensor type
I64
·
F32
·
I32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8

Quantized
(55)
this model
Quantizations
1 model