- canada-quant/DeepSeek-V4-Flash-W4A16-FP8
canada-quant/DeepSeek-V4-Flash-W4A16-FP8
Mixed-precision quantization of deepseek-ai/DeepSeek-V4-Flash — W4A16 INT4 on routed experts + FP8 block 128×128 on attention — that loads cleanly on Hopper datacenter GPUs and on consumer-grade Blackwell. Recipe topology mirrors RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8; routed-expert format is W4A16 (Marlin) instead of NVFP4 for compatibility with SM 9.x / SM 12.x kernels.
TL;DR
| Recommended hardware | 2× DGX Spark or 2× RTX PRO 6000, TP=2 |
| Quality | GSM8K 95.07–95.45% strict (8-shot); HumanEval pass@1 78.05–80.49% (strict, --confirm_run_unsafe_code) |
| Throughput | 47–48 output tok/s @ bs=1 on RTX PRO 6000 TP=2 (TPOT 20.8 ms); 14–17 tok/s on DGX Spark TP=2 |
| Differentiator | Only quant of V4-Flash that serves on SM 9.x and SM 12.x; baseline for the W4A16-FP8-MTP successor |
Family / related artifacts
| Repo | Role | Relation to this artifact |
|---|---|---|
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP |
successor | Same recipe + BF16 MTP retained for 1.49× spec-decode speedup at bs=1 |
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP |
sibling | NVFP4 routed experts (Blackwell-native), MTP retained |
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP |
larger sibling | V4-Pro at NVFP4 with MTP, B300-only deployment |
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 |
upstream reference | Original mixed-precision topology (NVFP4 experts + FP8 attention) we adapted to W4A16 |
Why this exists
DeepSeek-V4-Flash launched April 24, 2026 (284 B total / 13 B active, hybrid CSA + HCA attention, hash-routed experts). At release, no merged path through transformers + llm-compressor + vLLM existed for V4 quantization on Hopper or on SM 12.x Blackwell. RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 covered Blackwell datacenter (B100/B200, SM 10.x) via NVFP4 tcgen05 kernels, and Intel/DeepSeek-V4-Flash-W4A16-AutoRound covered W4A16 but explicitly excluded vLLM and SGLang. This artifact fills the gap: W4A16 GPTQ routed experts + FP8 block attention that serves on vLLM at TP=2 on H200 (Hopper SM 9.0a), DGX Spark (Blackwell SM 12.1a), and RTX PRO 6000 (Blackwell SM 12.0) — same weights, three SKUs.
Architecture & precision
Base model
| Property | Value |
|---|---|
| Total parameters | |
| Decoder layers | 43 |
| Routed experts / layer | 256 (top-K = 6) |
| Hidden size | 4096 |
| Base BF16 size | ~543 GB |
| Quantized size | ~143 GB |
| Compression ratio | ~3.8× |
Component precisions
| Component | Format | Method |
|---|---|---|
| Routed experts (256 × 43 layers) | W4A16 INT4, group_size=128, symmetric | GPTQ via llm-compressor, dampening_frac=0.1 |
Attention path (q_a/q_b/kv/o_a/o_b, compressor, indexer) |
FP8_BLOCK 128×128 | Dynamic, data-free |
| Shared experts | BF16 | Excluded (kylesayrs PR #41276 incompatibility) |
Embeddings, lm_head, hc_head |
BF16 | Excluded |
| MTP block | dropped at load | Removed by transformers _keys_to_ignore_on_load_unexpected — see W4A16-FP8-MTP successor for the retention recipe |
Hardware validated
| Platform | SM | HBM/GPU | Interconnect | TP | Role |
|---|---|---|---|---|---|
| 8× NVIDIA H200 SXM5 | 9.0a | 141 GB HBM3e | NVLink | 2 (4× replicas) | Calibration + harness baseline |
| 2× NVIDIA DGX Spark (GB10) | 12.1a | 128 GB unified | NVLink-C2C | 2 | Long-context production (1M-token graphs-ON) |
| 2× NVIDIA RTX PRO 6000 Blackwell Server Edition | 12.0, sm_120 | 96 GB HBM | PCIe | 2 | Workstation Blackwell deployment |
All three SKUs serve cuda graphs ON (no --enforce-eager). Same artifact, no weight changes between SKUs — only vLLM build flags and a few env vars differ.
Benchmarks
Quality
Sampling: greedy, temperature 0. lm-eval-harness via OpenAI-compatible backend pointing at the local vLLM. Methodology disclosed per row.
| Benchmark | Setting | 8× H200 (older vLLM build) | 2× DGX Spark TP=2 | 2× RTX PRO 6000 TP=2 |
|---|---|---|---|---|
| GSM8K | 8-shot, flexible-extract | 92.87% ± 0.71 | 95.37% ± 0.58 | 94.99% ± 0.60 |
| GSM8K | 8-shot, strict-match | 95.45% ± 0.57 | 95.07% ± 0.60 | |
| MMLU | 5-shot | 87.27% ± 0.27 | (in flight) | (pending) |
| HumanEval | 0-shot pass@1 (instruct, --confirm_run_unsafe_code) |
80.49% ± 3.10 | 78.05% ± 3.24 | |
| chat-smoke (quick / quality / coding) | harness | 4/4 · 4/4 · 2/2 | 4/4 · 4/4 · 2/2 | 4/4 · 4/4 · 2/2 |
| toolcall15 | 1 round, 30 points | 26/30 (87%) | 41/45 (92%)⁴ | 27/30 (90%) |
| NIAH long-context (75K → 500K single) | retrieval | — | 4/4 retrieval | 5/5 retrieval |
| NIAH 256K × 2 concurrent | retrieval | — | fix landed in jasl@e734ace5 |
4/4 (377 s) |
¹ The H200 GSM8K strict-match of 42.61% was a chat-format extraction artifact, not a quality regression. The flexible-extract number (92.87%) is the comparable figure. Cross-checked on DGX Spark / RTX PRO 6000 with corrected extraction (95.07–95.45%).
² ³ HumanEval pass@1 on H200 was initially reported as 54.27% under regex-based extraction. The harness was later corrected to use --confirm_run_unsafe_code (executes generated code), which raised the same-artifact score to 80.49%. The Spark and RTX PRO 6000 runs use the corrected methodology; the H200 number is the same artifact re-scored. See Changes for the dated correction.
⁴ Spark toolcall15 is scored across 3 thinking modes (45 cases); H200 / RTX PRO 6000 are single-round (30 cases). Scores normalized to %.
Comparison caveat: the H200 numbers come from an older vLLM build (harness HEAD
85aca32,jasl/vllm@428e08e). Spark and RTX PRO 6000 numbers are on today'sds4-sm120-experimentaltip. The valid same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 ↔ Blackwell deltas are informational.
Throughput
vllm bench serve random 1024-in / 1024-out, cuda graphs ON, MTP-spec n/a (this artifact ships without MTP).
| Hardware | TP | bs=1 output tok/s | bs=1 TPOT median | bs=2 output tok/s | bs=2 TPOT median |
|---|---|---|---|---|---|
| 2× DGX Spark | 2 | 14–17 | — | — | — |
| 2× DGX Spark | 2 (eager fallback) | 3–4 | — | — | — |
| 2× RTX PRO 6000 | 2 | 47.5 | 20.8 ms | 84.0 | 21.7 ms |
Per-stream decode rate on RTX PRO 6000 is rock-stable across concurrency (TPOT mean stays at 21 ms, p99 only 23 ms). Aggregate input+output throughput at bs=2 reaches 420 tok/s.
Quick start
vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8 \
--served-model-name DSV4-W4A16-FP8 \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 \
--block-size 256 \
--max-model-len 16384 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.92 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--trust-remote-code
Required env vars on SM 12.x sparse-MLA path: set VLLM_TRITON_MLA_SPARSE=1 and VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4. Without _HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel crashes during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered in _dequantize_and_gather_k_kernel (kernel falls back to a default block size that doesn't match V4-Flash's head dim). Full env block at findings/QUICKSTART_DUAL_SPARK.md §4.
Long-context (1M tokens, single stream): drop --max-num-seqs to 1, --gpu-memory-utilization to 0.90, set --max-model-len 1048576 --max-num-batched-tokens 8192 --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'.
Tensor parallelism: TP=2 is the only validated configuration. TP=1 OOMs on a single 141 GB H200; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).
RTX PRO 6000 (SM 12.0) only: set VLLM_USE_FLASHINFER_SAMPLER=0 — vLLM's FlashInfer-based top-p / top-k sampler JIT mis-parses the TORCH_CUDA_ARCH_LIST=12.0a token and incorrectly raises RuntimeError: FlashInfer requires GPUs with sm75 or higher.
Quantization recipe
| Property | Value |
|---|---|
| Dataset | HuggingFaceH4/ultrachat_200k (V4 chat template) |
| Samples | 768 |
| Max sequence length | 512 |
| Per-rank batch size | 4 |
| Hardware | 8× NVIDIA H200 (p5en.48xlarge) |
| Walltime | ~14 hours |
Required calibration environment
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=3600
export TORCH_NCCL_BLOCKING_WAIT=0
export NCCL_TIMEOUT=3600
export TORCH_CUDA_ARCH_LIST=9.0a
sudo mount -o remount,size=1800G /dev/shm
expandable_segments is calibration-only — must not be set during vLLM serving.
What didn't work (recorded so others don't waste cycles)
| Config | Result |
|---|---|
samples=1024, bs=32, no offload, no expandable_segments |
OOM at Layer 3 (45–67 GiB activation alloc fail) |
samples=1024, bs=8, same as above |
OOM at Layer 3 (32 GiB alloc fail) |
samples=1024, bs=8, offload_hessians=True |
OOM at Layer 3 (30 GiB alloc fail; fragmentation blocks contiguous block) |
samples=1024, bs=4, +offload_hessians, +expandable_segments |
NCCL collective timeout at Layer 22 (10 min default exceeded by per-rank drift) |
samples=768, bs=4, +offload_hessians, +expandable_segments, +60min NCCL timeout |
Succeeded — 14h end-to-end |
sequential_targets=["Linear"] (any sample count) |
torch.fx.proxy.TraceError on DeepseekV4Indexer.wrapped_1's data-dependent control flow — would need is_leaf_module patch to register Indexer as leaf |
Recipe
from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization.quant_scheme import FP8_BLOCK, W4A16, QuantizationScheme
recipe = GPTQModifier(
config_groups={
"attention": QuantizationScheme(
targets=[
r"re:.*self_attn\.(q_a_proj|q_b_proj|kv_proj|o_a_proj|o_b_proj)$",
r"re:.*self_attn\.compressor\.(gate_proj|kv_proj)$",
r"re:.*self_attn\.compressor\.indexer\.(gate_proj|kv_proj|q_b_proj|weights_proj)$",
],
**FP8_BLOCK,
),
"experts": QuantizationScheme(
targets=[r"re:.*mlp\.experts\.\d+\.(gate_proj|up_proj|down_proj)$"],
**W4A16,
),
},
ignore=["lm_head"],
offload_hessians=True,
dampening_frac=0.1,
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=512,
num_calibration_samples=768,
sequential_targets=["DeepseekV4DecoderLayer"],
batch_size=4,
)
vLLM build
This artifact does not load on vanilla vLLM. Stack:
| Component | Pin | Notes |
|---|---|---|
jasl/vllm |
ds4-sm120-experimental (or ds4-sm120 for conservative) |
SM12x DSV4 support |
| kylesayrs deepseek-ct patch | content-pinned, vendored at scripts/kylesayrs-deepseek-ct.patch |
Rebased successor of f910a73a93 (force-pushed out of upstream history; see issue #1) |
packed_modules_mapping patch |
patches/packed_modules_mapping.diff |
Required as of abad5dc71 (2026-05-05) — kylesayrs patch doesn't add this attribute |
| Workspace pre-reservation patch | landed upstream as jasl/vllm@1d6f5c4 |
Was vllm-project/vllm#41700 — no longer needs local apply |
Single-file bootstrap script for dual DGX Spark: scripts/bootstrap_dsv4_spark.sh — does the whole stack zero-to-serving.
Upstream tracker: original PR #40991 (where Spark validation was posted) closed 2026-05-06; current tracker is PR #41834 — "[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes", branch codex/ds4-sm120-min-enable.
Honest limitations
- No MTP —
transformers5.8.1's_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]silently strips MTP keys during calibration load. Speculative decoding cannot fire with this artifact. The W4A16-FP8-MTP successor retains MTP via a patched calibration path and delivers 1.49× spec-decode speedup at bs=1. - TP > 2 blocked by
vllm-project/vllm#41511— W4A16 MoE scale-sharding bug. - H200 numbers from older vLLM build — H200 baseline was scored on
jasl/vllm@428e08e(harness HEAD85aca32). Same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 → Blackwell deltas are informational. - toolcall15 TC-06 (Multi-Value Extraction) and TC-08 (Conditional Branching) also fail on the native FP4/FP8 baseline — V4-Flash model-architecture limits, not quantization defects.
- 2026-05-25: artifact has shipping issues on current upstream vLLM. Two problems were surfaced when attempting to load this artifact on
jasl/vllm@a02a3778f(the post-PR-#40923 build the siblingW4A16-MTP cardnow uses): (1) Same FP8_BLOCK compressor/indexer shipping bug as the MTP sibling — current vLLM constructs those modules as plain BF16 (quant_config=None) and the artifact fails withKeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'. The MTP sibling fixed this by dequantizing those weights in-artifact to BF16; this artifact has not yet had that fix applied. (2) A separate architecture-drift issue: the artifact lacks thelayers.N.ffn.gate.e_score_correction_biastensor that current upstream vLLM's DSV4 loader requires (KeyError). Either re-calibration that emits this tensor, or a defensive.get()loader patch upstream is needed. The published H200/Spark/RTX PRO 6000 numbers above remain valid for their originaljasl/vllm@ds4-sm120-experimental@abad5dc71build (2026-05-05); they do not currently reproduce on bleeding-edge vLLM. Tracking and re-verification deferred to the next session.
Reproduction
Full toolchain, scripts, patches, mission report: canada-quant/dsv4-flash-w4a16-fp8.
Single-file bootstrap (dual DGX Spark, idempotent, SSH-orchestrated):
curl -fsSLO https://raw.githubusercontent.com/canada-quant/dsv4-flash-w4a16-fp8/main/scripts/bootstrap_dsv4_spark.sh
chmod +x bootstrap_dsv4_spark.sh
./bootstrap_dsv4_spark.sh --head-host spark-a --worker-host spark-b
Upstream contributions filed during this work
| PR / Issue | Description | Status |
|---|---|---|
vllm-project/vllm#41700 |
Workspace pre-reservation patch | landed as jasl/vllm@1d6f5c4 |
vllm-project/vllm#41511 |
Marlin MoE TP scale-sharding bug | open — blocks TP>2 |
vllm-project/vllm#40991 → #41834 |
SM12x DeepSeek V4 base support | open (jasl) |
vllm-project/vllm#41276 |
compressed-tensors V4 attention path | open (kylesayrs) |
Changes
| Date | Change |
|---|---|
| 2026-05-06 | DGX Spark TP=2 production canonical at 1M-token context graphs-ON validated on ds4-sm120-experimental |
| 2026-05-08 | Kylesayrs branch f910a73a93 force-pushed out of upstream history; vendored content-pinned rebased successor d09eeb498 at scripts/kylesayrs-deepseek-ct.patch (issue #1) |
| 2026-05-19 | HumanEval methodology correction: H200 pass@1 was scored at 54.27% under regex extraction; re-scored at 80.49% with --confirm_run_unsafe_code. Same artifact, methodology change. Earlier 54.27% number is shown struck through in the quality table |
| 2026-05-23 | Workspace pre-reservation patch landed upstream as jasl/vllm@1d6f5c4; closes our #41700. No local apply needed |
| 2026-05-24 | RTX PRO 6000 Blackwell (SM 12.0) added to validated hardware — chat-smoke 4/4, toolcall15 27/30 (90%), GSM8K 95.07%, NIAH 256K × 2 concurrent PASS |
| 2026-05-25 | Two shipping issues surfaced when re-testing on current upstream vLLM (jasl/vllm@a02a3778f). (1) Same FP8 compressor/indexer load-failure as the W4A16-MTP sibling — fixable via the same in-artifact BF16 dequant; not yet applied to this artifact. (2) Architecture-drift KeyError: 'layers.N.ffn.gate.e_score_correction_bias' — Card A's older safetensors (calibrated 2026-05-06) don't contain a tensor that current vLLM's DSV4 loader expects; needs re-calibration or a defensive loader patch. Published RTX PRO 6000 numbers above remain valid for the May-5 jasl build; current-build re-verification deferred. See session_summary_2026_05_24.md. |
Files in the artifact
30 sharded143 GB total)model-*.safetensorsfiles +model.safetensors.index.json(config.json— vLLM-compatible quantization_config (W4A16 + FP8_BLOCK groups)tokenizer.json,tokenizer_config.json,generation_config.json— upstream DSV4-Flashrecipe.yaml— the llm-compressor calibration recipechat_template.jinja— upstream DSV4-Flash (unchanged)README.md— this file
Citation
@misc{canada-quant-dsv4-flash-w4a16-fp8-2026,
title = {DeepSeek-V4-Flash W4A16-FP8 for vLLM on Hopper and Blackwell},
author = {Canada Quant},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8}
}
License
MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.
Acknowledgments
- @jasl — DeepSeek-V4 vLLM SM12x base support (PR
#40991→#41834); memory-pressure-release fixe734ace5that resolved the Blackwell 256K×2 stall. - @kylesayrs — compressed-tensors V4 attention path (PR
#41276). - @aabbccddwasd — indexer KV cache layout fix.
- @bbbearxyz — SM12x Triton fallback kernels.
- @wuwenthink — SM12x harness validation.
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8— published reference for V4 mixed-precision attention topology.
- Downloads last month
- 7,438