Instructions to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="endnai/DeepSeek-V4-Flash-W4A8-FP8")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("endnai/DeepSeek-V4-Flash-W4A8-FP8")
model = AutoModelForCausalLM.from_pretrained("endnai/DeepSeek-V4-Flash-W4A8-FP8")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "endnai/DeepSeek-V4-Flash-W4A8-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "endnai/DeepSeek-V4-Flash-W4A8-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/endnai/DeepSeek-V4-Flash-W4A8-FP8

SGLang

How to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "endnai/DeepSeek-V4-Flash-W4A8-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "endnai/DeepSeek-V4-Flash-W4A8-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "endnai/DeepSeek-V4-Flash-W4A8-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "endnai/DeepSeek-V4-Flash-W4A8-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with Docker Model Runner:
```
docker model run hf.co/endnai/DeepSeek-V4-Flash-W4A8-FP8
```

DeepSeek-V4-Flash — W4A8 (INT4 weights + FP8 dynamic-token activations)

A W4A8 quantization of DeepSeek-V4-Flash: INT4 group-quantized MoE expert weights with FP8 (e4m3) dynamic per-token activations, plus FP8 block-quantized attention/dense layers. Produced as a zero-cost config transformation of canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP — the INT4 weight bytes are identical; only the activation quantization scheme in config.json changed (experts input_activations: null → FP8 dynamic-token).

⚠️ Honest headline first: on H200 (Hopper / SM90) this was the fastest single-config in our sweep — best TP2 prefill TTFT (1658 ms @24k) and highest per-GPU prefill throughput (7410 tok/s/GPU) of every cell tested. It ties its W4A16 parent (~2%, within run-to-run noise — the "W4A8 should be ~2× faster than W4A16" hypothesis was refuted), but it beats the FP4-marlin config by ~9–13% on the same 2×H200 footprint (int4→Marlin > nvfp4→Marlin). One caveat: it is vLLM-only (sglang can't load this checkpoint format), so it isn't a drop-in for an sglang deployment. See Investigation & findings.

📦 This is a config / recipe repository — the weight shards are NOT included. Because the W4A8 transformation reuses the base's INT4 weights byte-for-byte, duplicating ~159 GB here would be pure waste. This repo ships the W4A8 config.json, tokenizer, weight index, and this card. To get a runnable checkpoint, pull the weights from the base and drop in this config.json — see Getting the weights (one command).

What this is


Base architecture	DeepSeek-V4-Flash (284B total / ~13B active MoE, 43 layers, 256 routed experts top-6 + 1 shared, MLA, hybrid sparse attention + Lightning indexer)
Derived from	`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP` (identical INT4 expert weights)
MoE experts	INT4 group-quantized weights + FP8 e4m3 dynamic per-token activations (W4A8)
Attention / dense	FP8 block-quantized weights (unchanged from base)
`format`	`mixed-precision` (compressed-tensors)
Footprint	~159 GB materialized, fits TP2 on 2×H200 (identical to the W4A16 base). Weights not stored here — see Getting the weights.
Target hardware	NVIDIA Hopper (H100/H200, SM90)

How it was made

DeepSeek-V4-Flash's MoE experts are stored as INT4. A W4A16 checkpoint runs those INT4 weights through a Marlin dequant→BF16 GEMM; a W4A8 checkpoint instead pairs the same INT4 weights with FP8 activations, so vLLM dispatches them to the native CutlassExpertsW4A8Fp8 kernel on SM90 (_is_fp8_w4a8_sm90).

Because the weights are unchanged, the conversion is a pure config.json edit — no re-quantization, no calibration:

// experts config group, input_activations: null  ->
"input_activations": {
  "num_bits": 8, "type": "float", "strategy": "token",
  "dynamic": true, "symmetric": true
}

The _w4a8_conversion key in config.json records this provenance.

Getting the weights

The INT4 weight shards are identical to the base. Materialize a full checkpoint by downloading the base weights and overwriting config.json with this repo's W4A8 config:

# 1. base weights (INT4 shards, tokenizer) — the actual ~159 GB
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP --local-dir dsv4-w4a8

# 2. this repo's W4A8 config + card (the only real diff)
hf download endnai/DeepSeek-V4-Flash-W4A8-FP8 config.json README.md --local-dir dsv4-w4a8

# dsv4-w4a8/ is now a complete W4A8 checkpoint (INT4 weights + FP8-activation config)

The .safetensors bytes are unchanged; only config.json's expert input_activations differ (see below).

Serving (vLLM)

Requires a recent vLLM nightly and, at the time of writing, four small patches to load the DeepSeek-V4-Flash compressed-tensors checkpoint (these are model-loading fixes, not W4A8-specific — the same patches are needed for the W4A16 base on nightly):

packed_modules_mapping for the model and MTP module (fused_wqa_wkv, fused_wkv_wgate, gate_up_proj).
hash_moe added to the transformers ALLOWED_LAYER_TYPES global allowlist.
o_proj weight-scale name alias (weight_scale_inv → weight_scale).

Launch (2×H200, TP2) from the materialized directory (see Getting the weights):

vllm serve ./dsv4-w4a8 \
  --tensor-parallel-size 2 \
  --disable-custom-all-reduce \
  --trust-remote-code

--disable-custom-all-reduce avoids a TP2 init hang under confidential-compute (custom all-reduce needs CUDA-IPC/symmetric memory, which is unavailable inside TDX CVMs).

Correctness: verified matching the W4A16 base on a temp=0 quality probe (GSM8K 3/3 identical).

Investigation & findings

This checkpoint was built to test a hypothesis: the DeepSeek-V4-Flash prefill bottleneck is the INT4→BF16 Marlin MoE GEMM, so a W4A8 path (native FP8 activation GEMM) should be ~1.5–2× faster. The hypothesis was refuted. Full sweep on 2–8×H200 (TP2 unless noted), single-request prefill ladder (c=1), long-context (ISL up to 24k):

Headline: W4A8 leads the TP2 matrix, but ties W4A16

Config	Engine	TP	Prefill TTFT @24k	Prefill tok/s/GPU @24k
W4A8 (this model)	vLLM	2	1658 ms ⭐	7410 ⭐
W4A16 (base)	vLLM	2	1691 ms	7267
FP4 (marlin)	vLLM	2	1824 ms	7090
FP4 (marlin)	sglang	2	1894 ms	6832
FP8 (native)	sglang	4	892 ms	6888

W4A8 is the fastest TP2 config and the highest per-GPU throughput of every cell measured. Two things to read carefully:

vs W4A16 (its parent): a tie — 1658 vs 1691 ms is ~2%, within run-to-run noise. The specific hypothesis this checkpoint was built to test — "FP8-activation MoE GEMM should be ~1.5–2× faster than W4A16" — was refuted. At prefill batch-M the MoE is weight-bandwidth-bound, so activation precision doesn't move it and Marlin-W4A16 already matches Cutlass-W4A8.
vs FP4-marlin: a real ~9–13% win — int4→Marlin beats nvfp4→Marlin, so W4A8 (and W4A16) beat the FP4 base. FP4-marlin is what production currently runs, so W4A8/W4A16 are meaningfully faster than the deployed config on the same 2-GPU footprint.
The FP8-TP4 cell's low absolute TTFT (892 ms) is tensor-parallel scaling (2× the GPUs); per-GPU, W4A8-TP2 still wins (7410 > 6888).

Per-GPU throughput spans a narrow ~6.8–7.4k tok/s/GPU band across all cells — the architecture sets a ceiling — but within that band W4A8 sits at the top.

TP4 for this checkpoint is not yet benched — see To-do. Given W4A8-TP2 already leads on both TTFT and per-GPU, W4A8-TP4 is the most likely config to beat the FP8-TP4 892 ms absolute latency.

Why the activation-precision lever doesn't help

At prefill batch sizes, the DeepSeek-V4-Flash MoE (top-6 of 256 small experts) is weight-bandwidth-bound, not compute-bound on the expert GEMM. INT4 weights are already the bandwidth-optimal format, and Marlin's INT4→BF16 path already matches the Cutlass W4A8 kernel in practice. Switching activations from BF16/FP8-implicit to FP8 changes the activation precision but not the dominant cost. The compute-bound portion of prefill is dominated by format-shared work — FP8-block MLA attention and the sparse / Lightning-indexer passes over long context — which is identical across all three checkpoints.

The prefill ceiling is architectural on Hopper

Prefill scales linearly above ~~8k tokens (~~+547 ms per +8k) with GPUs at ~100% util and ~690 W (near TDP) → tensor-core-bound, not launch- or attention-quadratic-bound.
The two kernel improvements that would help — native NVFP4 MoE GEMM and the FP4 Lightning-indexer cache — are Blackwell-only (SM100). On Hopper, sglang/vLLM fall back to Marlin.
A W4A8 SM90 grouped-GEMM tuned for the DeepSeek-V4 MoE path is unimplemented upstream (relevant issues closed inactive). Even so, the wash above suggests it would offer little at prefill batch-M.

What does move the needle (deployment)

Prefix caching is the dominant lever: in production, DeepSeek-V4-Flash realizes ~55% radix prefix-cache hit on real agent/RAG traffic (measured over 24h), i.e. more than half of all prefill is skipped. This is already captured by sglang RadixAttention in production.
Larger chunked-prefill (8192 → 16384) gives ~7% faster long-context prefill TTFT on sglang, at the cost of KV-concurrency — a free win when the server isn't KV-bound.

Bottom line

W4A8 is the best-measured DeepSeek-V4-Flash config on Hopper at TP2 — top prefill TTFT and top per-GPU throughput. It ties its W4A16 sibling (so the 2× hypothesis failed), but it beats the FP4-marlin config that ships in production by ~9–13% on the same footprint. The practical catch is that this checkpoint format loads on vLLM only, so capturing that win over an sglang FP4 deployment means an engine switch, not a config swap. The dominant serving lever remains prefix caching (55% radix hit in prod); larger absolute-latency wins beyond this need Blackwell (native NVFP4 + FP4 indexer).

To-do

Bench TP4 for this checkpoint. W4A8-TP2 already leads the matrix on TTFT and per-GPU; W4A8-TP4 is the strongest candidate to beat the FP8-TP4 892 ms absolute TTFT while keeping INT4 weight footprint. (Not yet run.)

Reproducibility

Weights: byte-identical to canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP.
Transformation: the single config.json input_activations edit shown above (see the _w4a8_conversion provenance key).
To rebuild: take the W4A16 base, apply the config edit, serve with the vLLM nightly + patches above.

Acknowledgements

Built and benchmarked by Evrard Nil with Claude (2026-06). Base quantization by canada-quant; original model by DeepSeek-AI.

Downloads last month: -

Model tree for endnai/DeepSeek-V4-Flash-W4A8-FP8

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

Quantized

(1)

this model