canada-quant/DeepSeek-V4-Flash-W4A16-FP8

Mixed-precision quantization of deepseek-ai/DeepSeek-V4-Flash — W4A16 INT4 on routed experts + FP8 block 128×128 on attention — that loads cleanly on Hopper datacenter GPUs and on consumer-grade Blackwell. Recipe topology mirrors RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8; routed-expert format is W4A16 (Marlin) instead of NVFP4 for compatibility with SM 9.x / SM 12.x kernels.

TL;DR


Recommended hardware	2× DGX Spark or 2× RTX PRO 6000, TP=2
Quality	GSM8K 95.07–95.45% strict (8-shot); HumanEval pass@1 78.05–80.49% (strict, `--confirm_run_unsafe_code`)
Throughput	47–48 output tok/s @ bs=1 on RTX PRO 6000 TP=2 (TPOT 20.8 ms); 14–17 tok/s on DGX Spark TP=2
Differentiator	Only quant of V4-Flash that serves on SM 9.x and SM 12.x; baseline for the W4A16-FP8-MTP successor

Family / related artifacts

Repo	Role	Relation to this artifact
`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`	successor	Same recipe + BF16 MTP retained for 1.49× spec-decode speedup at bs=1
`canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP`	sibling	NVFP4 routed experts (Blackwell-native), MTP retained
`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`	larger sibling	V4-Pro at NVFP4 with MTP, B300-only deployment
`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`	upstream reference	Original mixed-precision topology (NVFP4 experts + FP8 attention) we adapted to W4A16

Why this exists

DeepSeek-V4-Flash launched April 24, 2026 (284 B total / 13 B active, hybrid CSA + HCA attention, hash-routed experts). At release, no merged path through transformers + llm-compressor + vLLM existed for V4 quantization on Hopper or on SM 12.x Blackwell. RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 covered Blackwell datacenter (B100/B200, SM 10.x) via NVFP4 tcgen05 kernels, and Intel/DeepSeek-V4-Flash-W4A16-AutoRound covered W4A16 but explicitly excluded vLLM and SGLang. This artifact fills the gap: W4A16 GPTQ routed experts + FP8 block attention that serves on vLLM at TP=2 on H200 (Hopper SM 9.0a), DGX Spark (Blackwell SM 12.1a), and RTX PRO 6000 (Blackwell SM 12.0) — same weights, three SKUs.

Architecture & precision

Base model

Property	Value
Total parameters	~~284 B (~~13 B active per token)
Decoder layers	43
Routed experts / layer	256 (top-K = 6)
Hidden size	4096
Base BF16 size	~543 GB
Quantized size	~143 GB
Compression ratio	~3.8×

Component precisions

Component	Format	Method
Routed experts (256 × 43 layers)	W4A16 INT4, group_size=128, symmetric	GPTQ via llm-compressor, `dampening_frac=0.1`
Attention path (`q_a/q_b/kv/o_a/o_b`, compressor, indexer)	FP8_BLOCK 128×128	Dynamic, data-free
Shared experts	BF16	Excluded (kylesayrs PR #41276 incompatibility)
Embeddings, `lm_head`, `hc_head`	BF16	Excluded
MTP block	dropped at load	Removed by `transformers` `_keys_to_ignore_on_load_unexpected` — see W4A16-FP8-MTP successor for the retention recipe

Hardware validated

Platform	SM	HBM/GPU	Interconnect	TP	Role
8× NVIDIA H200 SXM5	9.0a	141 GB HBM3e	NVLink	2 (4× replicas)	Calibration + harness baseline
2× NVIDIA DGX Spark (GB10)	12.1a	128 GB unified	NVLink-C2C	2	Long-context production (1M-token graphs-ON)
2× NVIDIA RTX PRO 6000 Blackwell Server Edition	12.0, sm_120	96 GB HBM	PCIe	2	Workstation Blackwell deployment

All three SKUs serve cuda graphs ON (no --enforce-eager). Same artifact, no weight changes between SKUs — only vLLM build flags and a few env vars differ.

Benchmarks

Quality

Sampling: greedy, temperature 0. lm-eval-harness via OpenAI-compatible backend pointing at the local vLLM. Methodology disclosed per row.

Benchmark	Setting	8× H200 (older vLLM build)	2× DGX Spark TP=2	2× RTX PRO 6000 TP=2
GSM8K	8-shot, flexible-extract	92.87% ± 0.71	95.37% ± 0.58	94.99% ± 0.60
GSM8K	8-shot, strict-match	~~42.61%~~¹ → see note	95.45% ± 0.57	95.07% ± 0.60
MMLU	5-shot	87.27% ± 0.27	(in flight)	(pending)
HumanEval	0-shot pass@1 (instruct, `--confirm_run_unsafe_code`)	~~54.27% ± 3.9~~² → 80.49% ± 3.10³	80.49% ± 3.10	78.05% ± 3.24
chat-smoke (quick / quality / coding)	harness	4/4 · 4/4 · 2/2	4/4 · 4/4 · 2/2	4/4 · 4/4 · 2/2
toolcall15	1 round, 30 points	26/30 (87%)	41/45 (92%)⁴	27/30 (90%)
NIAH long-context (75K → 500K single)	retrieval	—	4/4 retrieval	5/5 retrieval
NIAH 256K × 2 concurrent	retrieval	—	fix landed in `jasl@e734ace5`	4/4 (377 s)

¹ The H200 GSM8K strict-match of 42.61% was a chat-format extraction artifact, not a quality regression. The flexible-extract number (92.87%) is the comparable figure. Cross-checked on DGX Spark / RTX PRO 6000 with corrected extraction (95.07–95.45%).

² ³ HumanEval pass@1 on H200 was initially reported as 54.27% under regex-based extraction. The harness was later corrected to use --confirm_run_unsafe_code (executes generated code), which raised the same-artifact score to 80.49%. The Spark and RTX PRO 6000 runs use the corrected methodology; the H200 number is the same artifact re-scored. See Changes for the dated correction.

⁴ Spark toolcall15 is scored across 3 thinking modes (45 cases); H200 / RTX PRO 6000 are single-round (30 cases). Scores normalized to %.

Comparison caveat: the H200 numbers come from an older vLLM build (harness HEAD 85aca32, jasl/vllm@428e08e). Spark and RTX PRO 6000 numbers are on today's ds4-sm120-experimental tip. The valid same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 ↔ Blackwell deltas are informational.

Throughput

vllm bench serve random 1024-in / 1024-out, cuda graphs ON, MTP-spec n/a (this artifact ships without MTP).

Hardware	TP	bs=1 output tok/s	bs=1 TPOT median	bs=2 output tok/s	bs=2 TPOT median
2× DGX Spark	2	14–17	—	—	—
2× DGX Spark	2 (eager fallback)	3–4	—	—	—
2× RTX PRO 6000	2	47.5	20.8 ms	84.0	21.7 ms

Per-stream decode rate on RTX PRO 6000 is rock-stable across concurrency (TPOT mean stays at 21 ms, p99 only 23 ms). Aggregate input+output throughput at bs=2 reaches 420 tok/s.

Quick start

vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8 \
  --served-model-name DSV4-W4A16-FP8 \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --max-model-len 16384 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code

Required env vars on SM 12.x sparse-MLA path: set VLLM_TRITON_MLA_SPARSE=1 and VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4. Without _HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel crashes during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered in _dequantize_and_gather_k_kernel (kernel falls back to a default block size that doesn't match V4-Flash's head dim). Full env block at findings/QUICKSTART_DUAL_SPARK.md §4.

Long-context (1M tokens, single stream): drop --max-num-seqs to 1, --gpu-memory-utilization to 0.90, set --max-model-len 1048576 --max-num-batched-tokens 8192 --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'.

Tensor parallelism: TP=2 is the only validated configuration. TP=1 OOMs on a single 141 GB H200; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).

RTX PRO 6000 (SM 12.0) only: set VLLM_USE_FLASHINFER_SAMPLER=0 — vLLM's FlashInfer-based top-p / top-k sampler JIT mis-parses the TORCH_CUDA_ARCH_LIST=12.0a token and incorrectly raises RuntimeError: FlashInfer requires GPUs with sm75 or higher.

Quantization recipe

Property	Value
Dataset	`HuggingFaceH4/ultrachat_200k` (V4 chat template)
Samples	768
Max sequence length	512
Per-rank batch size	4
Hardware	8× NVIDIA H200 (`p5en.48xlarge`)
Walltime	~14 hours

Required calibration environment

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=3600
export TORCH_NCCL_BLOCKING_WAIT=0
export NCCL_TIMEOUT=3600
export TORCH_CUDA_ARCH_LIST=9.0a
sudo mount -o remount,size=1800G /dev/shm

expandable_segments is calibration-only — must not be set during vLLM serving.

What didn't work (recorded so others don't waste cycles)

Config	Result
`samples=1024, bs=32, no offload, no expandable_segments`	OOM at Layer 3 (45–67 GiB activation alloc fail)
`samples=1024, bs=8`, same as above	OOM at Layer 3 (32 GiB alloc fail)
`samples=1024, bs=8, offload_hessians=True`	OOM at Layer 3 (30 GiB alloc fail; fragmentation blocks contiguous block)
`samples=1024, bs=4, +offload_hessians, +expandable_segments`	NCCL collective timeout at Layer 22 (10 min default exceeded by per-rank drift)
`samples=768, bs=4, +offload_hessians, +expandable_segments, +60min NCCL timeout`	Succeeded — 14h end-to-end
`sequential_targets=["Linear"]` (any sample count)	`torch.fx.proxy.TraceError` on `DeepseekV4Indexer.wrapped_1`'s data-dependent control flow — would need `is_leaf_module` patch to register Indexer as leaf

Recipe

from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization.quant_scheme import FP8_BLOCK, W4A16, QuantizationScheme

recipe = GPTQModifier(
    config_groups={
        "attention": QuantizationScheme(
            targets=[
                r"re:.*self_attn\.(q_a_proj|q_b_proj|kv_proj|o_a_proj|o_b_proj)$",
                r"re:.*self_attn\.compressor\.(gate_proj|kv_proj)$",
                r"re:.*self_attn\.compressor\.indexer\.(gate_proj|kv_proj|q_b_proj|weights_proj)$",
            ],
            **FP8_BLOCK,
        ),
        "experts": QuantizationScheme(
            targets=[r"re:.*mlp\.experts\.\d+\.(gate_proj|up_proj|down_proj)$"],
            **W4A16,
        ),
    },
    ignore=["lm_head"],
    offload_hessians=True,
    dampening_frac=0.1,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=512,
    num_calibration_samples=768,
    sequential_targets=["DeepseekV4DecoderLayer"],
    batch_size=4,
)

vLLM build

This artifact does not load on vanilla vLLM. Stack:

Component	Pin	Notes
`jasl/vllm`	`ds4-sm120-experimental` (or `ds4-sm120` for conservative)	SM12x DSV4 support
kylesayrs deepseek-ct patch	content-pinned, vendored at `scripts/kylesayrs-deepseek-ct.patch`	Rebased successor of `f910a73a93` (force-pushed out of upstream history; see issue #1)
`packed_modules_mapping` patch	`patches/packed_modules_mapping.diff`	Required as of `abad5dc71` (2026-05-05) — kylesayrs patch doesn't add this attribute
Workspace pre-reservation patch	landed upstream as `jasl/vllm@1d6f5c4`	Was `vllm-project/vllm#41700` — no longer needs local apply

Single-file bootstrap script for dual DGX Spark: scripts/bootstrap_dsv4_spark.sh — does the whole stack zero-to-serving.

Upstream tracker: original PR #40991 (where Spark validation was posted) closed 2026-05-06; current tracker is PR #41834 — "[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes", branch codex/ds4-sm120-min-enable.

Honest limitations

No MTP — transformers 5.8.1's _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"] silently strips MTP keys during calibration load. Speculative decoding cannot fire with this artifact. The W4A16-FP8-MTP successor retains MTP via a patched calibration path and delivers 1.49× spec-decode speedup at bs=1.
TP > 2 blocked by vllm-project/vllm#41511 — W4A16 MoE scale-sharding bug.
H200 numbers from older vLLM build — H200 baseline was scored on jasl/vllm@428e08e (harness HEAD 85aca32). Same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 → Blackwell deltas are informational.
toolcall15 TC-06 (Multi-Value Extraction) and TC-08 (Conditional Branching) also fail on the native FP4/FP8 baseline — V4-Flash model-architecture limits, not quantization defects.
2026-05-25: artifact has shipping issues on current upstream vLLM. Two problems were surfaced when attempting to load this artifact on jasl/vllm@a02a3778f (the post-PR-#40923 build the sibling W4A16-MTP card now uses): (1) Same FP8_BLOCK compressor/indexer shipping bug as the MTP sibling — current vLLM constructs those modules as plain BF16 (quant_config=None) and the artifact fails with KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'. The MTP sibling fixed this by dequantizing those weights in-artifact to BF16; this artifact has not yet had that fix applied. (2) A separate architecture-drift issue: the artifact lacks the layers.N.ffn.gate.e_score_correction_bias tensor that current upstream vLLM's DSV4 loader requires (KeyError). Either re-calibration that emits this tensor, or a defensive .get() loader patch upstream is needed. The published H200/Spark/RTX PRO 6000 numbers above remain valid for their original jasl/vllm@ds4-sm120-experimental@abad5dc71 build (2026-05-05); they do not currently reproduce on bleeding-edge vLLM. Tracking and re-verification deferred to the next session.

Reproduction

Full toolchain, scripts, patches, mission report: canada-quant/dsv4-flash-w4a16-fp8.

Single-file bootstrap (dual DGX Spark, idempotent, SSH-orchestrated):

curl -fsSLO https://raw.githubusercontent.com/canada-quant/dsv4-flash-w4a16-fp8/main/scripts/bootstrap_dsv4_spark.sh
chmod +x bootstrap_dsv4_spark.sh
./bootstrap_dsv4_spark.sh --head-host spark-a --worker-host spark-b

Upstream contributions filed during this work

PR / Issue	Description	Status
`vllm-project/vllm#41700`	Workspace pre-reservation patch	landed as `jasl/vllm@1d6f5c4`
`vllm-project/vllm#41511`	Marlin MoE TP scale-sharding bug	open — blocks TP>2
`vllm-project/vllm#40991` → `#41834`	SM12x DeepSeek V4 base support	open (jasl)
`vllm-project/vllm#41276`	compressed-tensors V4 attention path	open (kylesayrs)

Changes

Date	Change
2026-05-06	DGX Spark TP=2 production canonical at 1M-token context graphs-ON validated on `ds4-sm120-experimental`
2026-05-08	Kylesayrs branch `f910a73a93` force-pushed out of upstream history; vendored content-pinned rebased successor `d09eeb498` at `scripts/kylesayrs-deepseek-ct.patch` (issue #1)
2026-05-19	HumanEval methodology correction: H200 pass@1 was scored at 54.27% under regex extraction; re-scored at 80.49% with `--confirm_run_unsafe_code`. Same artifact, methodology change. Earlier 54.27% number is shown struck through in the quality table
2026-05-23	Workspace pre-reservation patch landed upstream as `jasl/vllm@1d6f5c4`; closes our `#41700`. No local apply needed
2026-05-24	RTX PRO 6000 Blackwell (SM 12.0) added to validated hardware — chat-smoke 4/4, toolcall15 27/30 (90%), GSM8K 95.07%, NIAH 256K × 2 concurrent PASS
2026-05-25	Two shipping issues surfaced when re-testing on current upstream vLLM (`jasl/vllm@a02a3778f`). (1) Same FP8 compressor/indexer load-failure as the W4A16-MTP sibling — fixable via the same in-artifact BF16 dequant; not yet applied to this artifact. (2) Architecture-drift `KeyError: 'layers.N.ffn.gate.e_score_correction_bias'` — Card A's older safetensors (calibrated 2026-05-06) don't contain a tensor that current vLLM's DSV4 loader expects; needs re-calibration or a defensive loader patch. Published RTX PRO 6000 numbers above remain valid for the May-5 jasl build; current-build re-verification deferred. See `session_summary_2026_05_24.md`.

Files in the artifact

~~30 sharded model-*.safetensors files + model.safetensors.index.json (~~143 GB total)
config.json — vLLM-compatible quantization_config (W4A16 + FP8_BLOCK groups)
tokenizer.json, tokenizer_config.json, generation_config.json — upstream DSV4-Flash
recipe.yaml — the llm-compressor calibration recipe
chat_template.jinja — upstream DSV4-Flash (unchanged)
README.md — this file

Citation

@misc{canada-quant-dsv4-flash-w4a16-fp8-2026,
  title  = {DeepSeek-V4-Flash W4A16-FP8 for vLLM on Hopper and Blackwell},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.

Acknowledgments

@jasl — DeepSeek-V4 vLLM SM12x base support (PR #40991 → #41834); memory-pressure-release fix e734ace5 that resolved the Blackwell 256K×2 stall.
@kylesayrs — compressed-tensors V4 attention path (PR #41276).
@aabbccddwasd — indexer KV cache layout fix.
@bbbearxyz — SM12x Triton fallback kernels.
@wuwenthink — SM12x harness validation.
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 — published reference for V4 mixed-precision attention topology.

Downloads last month: 7,438

Safetensors

Model size

44B params

Tensor type

I64

F32

I32

BF16

F8_E4M3

Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(55)

this model

Quantizations

1 model