DeepSeek-V4-Flash — W4A8 (INT4 weights + FP8 dynamic-token activations)

A W4A8 quantization of DeepSeek-V4-Flash: INT4 group-quantized MoE expert weights with FP8 (e4m3) dynamic per-token activations, plus FP8 block-quantized attention/dense layers. Produced as a zero-cost config transformation of canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP — the INT4 weight bytes are identical; only the activation quantization scheme in config.json changed (experts input_activations: null → FP8 dynamic-token).

⚠️ Honest headline first: on H200 (Hopper / SM90) this was the fastest single-config in our sweep — best TP2 prefill TTFT (1658 ms @24k) and highest per-GPU prefill throughput (7410 tok/s/GPU) of every cell tested. It ties its W4A16 parent (~2%, within run-to-run noise — the "W4A8 should be ~2× faster than W4A16" hypothesis was refuted), but it beats the FP4-marlin config by ~9–13% on the same 2×H200 footprint (int4→Marlin > nvfp4→Marlin). One caveat: it is vLLM-only (sglang can't load this checkpoint format), so it isn't a drop-in for an sglang deployment. See Investigation & findings.

📦 This is a config / recipe repository — the weight shards are NOT included. Because the W4A8 transformation reuses the base's INT4 weights byte-for-byte, duplicating ~159 GB here would be pure waste. This repo ships the W4A8 config.json, tokenizer, weight index, and this card. To get a runnable checkpoint, pull the weights from the base and drop in this config.json — see Getting the weights (one command).

What this is

Base architecture DeepSeek-V4-Flash (284B total / ~13B active MoE, 43 layers, 256 routed experts top-6 + 1 shared, MLA, hybrid sparse attention + Lightning indexer)
Derived from canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (identical INT4 expert weights)
MoE experts INT4 group-quantized weights + FP8 e4m3 dynamic per-token activations (W4A8)
Attention / dense FP8 block-quantized weights (unchanged from base)
format mixed-precision (compressed-tensors)
Footprint ~159 GB materialized, fits TP2 on 2×H200 (identical to the W4A16 base). Weights not stored here — see Getting the weights.
Target hardware NVIDIA Hopper (H100/H200, SM90)

How it was made

DeepSeek-V4-Flash's MoE experts are stored as INT4. A W4A16 checkpoint runs those INT4 weights through a Marlin dequant→BF16 GEMM; a W4A8 checkpoint instead pairs the same INT4 weights with FP8 activations, so vLLM dispatches them to the native CutlassExpertsW4A8Fp8 kernel on SM90 (_is_fp8_w4a8_sm90).

Because the weights are unchanged, the conversion is a pure config.json edit — no re-quantization, no calibration:

// experts config group, input_activations: null  ->
"input_activations": {
  "num_bits": 8, "type": "float", "strategy": "token",
  "dynamic": true, "symmetric": true
}

The _w4a8_conversion key in config.json records this provenance.

Getting the weights

The INT4 weight shards are identical to the base. Materialize a full checkpoint by downloading the base weights and overwriting config.json with this repo's W4A8 config:

# 1. base weights (INT4 shards, tokenizer) — the actual ~159 GB
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP --local-dir dsv4-w4a8

# 2. this repo's W4A8 config + card (the only real diff)
hf download endnai/DeepSeek-V4-Flash-W4A8-FP8 config.json README.md --local-dir dsv4-w4a8

# dsv4-w4a8/ is now a complete W4A8 checkpoint (INT4 weights + FP8-activation config)

The .safetensors bytes are unchanged; only config.json's expert input_activations differ (see below).

Serving (vLLM)

Requires a recent vLLM nightly and, at the time of writing, four small patches to load the DeepSeek-V4-Flash compressed-tensors checkpoint (these are model-loading fixes, not W4A8-specific — the same patches are needed for the W4A16 base on nightly):

  1. packed_modules_mapping for the model and MTP module (fused_wqa_wkv, fused_wkv_wgate, gate_up_proj).
  2. hash_moe added to the transformers ALLOWED_LAYER_TYPES global allowlist.
  3. o_proj weight-scale name alias (weight_scale_invweight_scale).

Launch (2×H200, TP2) from the materialized directory (see Getting the weights):

vllm serve ./dsv4-w4a8 \
  --tensor-parallel-size 2 \
  --disable-custom-all-reduce \
  --trust-remote-code

--disable-custom-all-reduce avoids a TP2 init hang under confidential-compute (custom all-reduce needs CUDA-IPC/symmetric memory, which is unavailable inside TDX CVMs).

Correctness: verified matching the W4A16 base on a temp=0 quality probe (GSM8K 3/3 identical).

Investigation & findings

This checkpoint was built to test a hypothesis: the DeepSeek-V4-Flash prefill bottleneck is the INT4→BF16 Marlin MoE GEMM, so a W4A8 path (native FP8 activation GEMM) should be ~1.5–2× faster. The hypothesis was refuted. Full sweep on 2–8×H200 (TP2 unless noted), single-request prefill ladder (c=1), long-context (ISL up to 24k):

Headline: W4A8 leads the TP2 matrix, but ties W4A16

Config Engine TP Prefill TTFT @24k Prefill tok/s/GPU @24k
W4A8 (this model) vLLM 2 1658 ms 7410
W4A16 (base) vLLM 2 1691 ms 7267
FP4 (marlin) vLLM 2 1824 ms 7090
FP4 (marlin) sglang 2 1894 ms 6832
FP8 (native) sglang 4 892 ms 6888

W4A8 is the fastest TP2 config and the highest per-GPU throughput of every cell measured. Two things to read carefully:

  • vs W4A16 (its parent): a tie — 1658 vs 1691 ms is ~2%, within run-to-run noise. The specific hypothesis this checkpoint was built to test — "FP8-activation MoE GEMM should be ~1.5–2× faster than W4A16" — was refuted. At prefill batch-M the MoE is weight-bandwidth-bound, so activation precision doesn't move it and Marlin-W4A16 already matches Cutlass-W4A8.
  • vs FP4-marlin: a real ~9–13% win — int4→Marlin beats nvfp4→Marlin, so W4A8 (and W4A16) beat the FP4 base. FP4-marlin is what production currently runs, so W4A8/W4A16 are meaningfully faster than the deployed config on the same 2-GPU footprint.
  • The FP8-TP4 cell's low absolute TTFT (892 ms) is tensor-parallel scaling (2× the GPUs); per-GPU, W4A8-TP2 still wins (7410 > 6888).

Per-GPU throughput spans a narrow ~6.8–7.4k tok/s/GPU band across all cells — the architecture sets a ceiling — but within that band W4A8 sits at the top.

TP4 for this checkpoint is not yet benched — see To-do. Given W4A8-TP2 already leads on both TTFT and per-GPU, W4A8-TP4 is the most likely config to beat the FP8-TP4 892 ms absolute latency.

Why the activation-precision lever doesn't help

At prefill batch sizes, the DeepSeek-V4-Flash MoE (top-6 of 256 small experts) is weight-bandwidth-bound, not compute-bound on the expert GEMM. INT4 weights are already the bandwidth-optimal format, and Marlin's INT4→BF16 path already matches the Cutlass W4A8 kernel in practice. Switching activations from BF16/FP8-implicit to FP8 changes the activation precision but not the dominant cost. The compute-bound portion of prefill is dominated by format-shared work — FP8-block MLA attention and the sparse / Lightning-indexer passes over long context — which is identical across all three checkpoints.

The prefill ceiling is architectural on Hopper

  • Prefill scales linearly above 8k tokens (+547 ms per +8k) with GPUs at ~100% util and ~690 W (near TDP) → tensor-core-bound, not launch- or attention-quadratic-bound.
  • The two kernel improvements that would help — native NVFP4 MoE GEMM and the FP4 Lightning-indexer cache — are Blackwell-only (SM100). On Hopper, sglang/vLLM fall back to Marlin.
  • A W4A8 SM90 grouped-GEMM tuned for the DeepSeek-V4 MoE path is unimplemented upstream (relevant issues closed inactive). Even so, the wash above suggests it would offer little at prefill batch-M.

What does move the needle (deployment)

  • Prefix caching is the dominant lever: in production, DeepSeek-V4-Flash realizes ~55% radix prefix-cache hit on real agent/RAG traffic (measured over 24h), i.e. more than half of all prefill is skipped. This is already captured by sglang RadixAttention in production.
  • Larger chunked-prefill (8192 → 16384) gives ~7% faster long-context prefill TTFT on sglang, at the cost of KV-concurrency — a free win when the server isn't KV-bound.

Bottom line

W4A8 is the best-measured DeepSeek-V4-Flash config on Hopper at TP2 — top prefill TTFT and top per-GPU throughput. It ties its W4A16 sibling (so the 2× hypothesis failed), but it beats the FP4-marlin config that ships in production by ~9–13% on the same footprint. The practical catch is that this checkpoint format loads on vLLM only, so capturing that win over an sglang FP4 deployment means an engine switch, not a config swap. The dominant serving lever remains prefix caching (55% radix hit in prod); larger absolute-latency wins beyond this need Blackwell (native NVFP4 + FP4 indexer).

To-do

  • Bench TP4 for this checkpoint. W4A8-TP2 already leads the matrix on TTFT and per-GPU; W4A8-TP4 is the strongest candidate to beat the FP8-TP4 892 ms absolute TTFT while keeping INT4 weight footprint. (Not yet run.)

Reproducibility

  • Weights: byte-identical to canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP.
  • Transformation: the single config.json input_activations edit shown above (see the _w4a8_conversion provenance key).
  • To rebuild: take the W4A16 base, apply the config edit, serve with the vLLM nightly + patches above.

Acknowledgements

Built and benchmarked by Evrard Nil with Claude (2026-06). Base quantization by canada-quant; original model by DeepSeek-AI.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for endnai/DeepSeek-V4-Flash-W4A8-FP8