Instructions to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="endnai/DeepSeek-V4-Flash-W4A8-FP8")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("endnai/DeepSeek-V4-Flash-W4A8-FP8") model = AutoModelForCausalLM.from_pretrained("endnai/DeepSeek-V4-Flash-W4A8-FP8") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "endnai/DeepSeek-V4-Flash-W4A8-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "endnai/DeepSeek-V4-Flash-W4A8-FP8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/endnai/DeepSeek-V4-Flash-W4A8-FP8
- SGLang
How to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "endnai/DeepSeek-V4-Flash-W4A8-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "endnai/DeepSeek-V4-Flash-W4A8-FP8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "endnai/DeepSeek-V4-Flash-W4A8-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "endnai/DeepSeek-V4-Flash-W4A8-FP8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use endnai/DeepSeek-V4-Flash-W4A8-FP8 with Docker Model Runner:
docker model run hf.co/endnai/DeepSeek-V4-Flash-W4A8-FP8
DeepSeek-V4-Flash — W4A8 (INT4 weights + FP8 dynamic-token activations)
A W4A8 quantization of DeepSeek-V4-Flash: INT4 group-quantized MoE expert weights with FP8 (e4m3) dynamic per-token activations, plus FP8 block-quantized attention/dense layers. Produced as a zero-cost config transformation of canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP — the INT4 weight bytes are identical; only the activation quantization scheme in config.json changed (experts input_activations: null → FP8 dynamic-token).
⚠️ Honest headline first: on H200 (Hopper / SM90) this was the fastest single-config in our sweep — best TP2 prefill TTFT (1658 ms @24k) and highest per-GPU prefill throughput (7410 tok/s/GPU) of every cell tested. It ties its W4A16 parent (~2%, within run-to-run noise — the "W4A8 should be ~2× faster than W4A16" hypothesis was refuted), but it beats the FP4-marlin config by ~9–13% on the same 2×H200 footprint (int4→Marlin > nvfp4→Marlin). One caveat: it is vLLM-only (sglang can't load this checkpoint format), so it isn't a drop-in for an sglang deployment. See Investigation & findings.
📦 This is a config / recipe repository — the weight shards are NOT included. Because the W4A8 transformation reuses the base's INT4 weights byte-for-byte, duplicating ~159 GB here would be pure waste. This repo ships the W4A8
config.json, tokenizer, weight index, and this card. To get a runnable checkpoint, pull the weights from the base and drop in thisconfig.json— see Getting the weights (one command).
What this is
| Base architecture | DeepSeek-V4-Flash (284B total / ~13B active MoE, 43 layers, 256 routed experts top-6 + 1 shared, MLA, hybrid sparse attention + Lightning indexer) |
| Derived from | canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (identical INT4 expert weights) |
| MoE experts | INT4 group-quantized weights + FP8 e4m3 dynamic per-token activations (W4A8) |
| Attention / dense | FP8 block-quantized weights (unchanged from base) |
format |
mixed-precision (compressed-tensors) |
| Footprint | ~159 GB materialized, fits TP2 on 2×H200 (identical to the W4A16 base). Weights not stored here — see Getting the weights. |
| Target hardware | NVIDIA Hopper (H100/H200, SM90) |
How it was made
DeepSeek-V4-Flash's MoE experts are stored as INT4. A W4A16 checkpoint runs those INT4 weights through a Marlin dequant→BF16 GEMM; a W4A8 checkpoint instead pairs the same INT4 weights with FP8 activations, so vLLM dispatches them to the native CutlassExpertsW4A8Fp8 kernel on SM90 (_is_fp8_w4a8_sm90).
Because the weights are unchanged, the conversion is a pure config.json edit — no re-quantization, no calibration:
// experts config group, input_activations: null ->
"input_activations": {
"num_bits": 8, "type": "float", "strategy": "token",
"dynamic": true, "symmetric": true
}
The _w4a8_conversion key in config.json records this provenance.
Getting the weights
The INT4 weight shards are identical to the base. Materialize a full checkpoint by downloading the base weights and overwriting config.json with this repo's W4A8 config:
# 1. base weights (INT4 shards, tokenizer) — the actual ~159 GB
hf download canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP --local-dir dsv4-w4a8
# 2. this repo's W4A8 config + card (the only real diff)
hf download endnai/DeepSeek-V4-Flash-W4A8-FP8 config.json README.md --local-dir dsv4-w4a8
# dsv4-w4a8/ is now a complete W4A8 checkpoint (INT4 weights + FP8-activation config)
The .safetensors bytes are unchanged; only config.json's expert input_activations differ (see below).
Serving (vLLM)
Requires a recent vLLM nightly and, at the time of writing, four small patches to load the DeepSeek-V4-Flash compressed-tensors checkpoint (these are model-loading fixes, not W4A8-specific — the same patches are needed for the W4A16 base on nightly):
packed_modules_mappingfor the model and MTP module (fused_wqa_wkv,fused_wkv_wgate,gate_up_proj).hash_moeadded to the transformersALLOWED_LAYER_TYPESglobal allowlist.o_projweight-scale name alias (weight_scale_inv→weight_scale).
Launch (2×H200, TP2) from the materialized directory (see Getting the weights):
vllm serve ./dsv4-w4a8 \
--tensor-parallel-size 2 \
--disable-custom-all-reduce \
--trust-remote-code
--disable-custom-all-reduce avoids a TP2 init hang under confidential-compute (custom all-reduce needs CUDA-IPC/symmetric memory, which is unavailable inside TDX CVMs).
Correctness: verified matching the W4A16 base on a temp=0 quality probe (GSM8K 3/3 identical).
Investigation & findings
This checkpoint was built to test a hypothesis: the DeepSeek-V4-Flash prefill bottleneck is the INT4→BF16 Marlin MoE GEMM, so a W4A8 path (native FP8 activation GEMM) should be ~1.5–2× faster. The hypothesis was refuted. Full sweep on 2–8×H200 (TP2 unless noted), single-request prefill ladder (c=1), long-context (ISL up to 24k):
Headline: W4A8 leads the TP2 matrix, but ties W4A16
| Config | Engine | TP | Prefill TTFT @24k | Prefill tok/s/GPU @24k |
|---|---|---|---|---|
| W4A8 (this model) | vLLM | 2 | 1658 ms ⭐ | 7410 ⭐ |
| W4A16 (base) | vLLM | 2 | 1691 ms | 7267 |
| FP4 (marlin) | vLLM | 2 | 1824 ms | 7090 |
| FP4 (marlin) | sglang | 2 | 1894 ms | 6832 |
| FP8 (native) | sglang | 4 | 892 ms | 6888 |
W4A8 is the fastest TP2 config and the highest per-GPU throughput of every cell measured. Two things to read carefully:
- vs W4A16 (its parent): a tie — 1658 vs 1691 ms is ~2%, within run-to-run noise. The specific hypothesis this checkpoint was built to test — "FP8-activation MoE GEMM should be ~1.5–2× faster than W4A16" — was refuted. At prefill batch-M the MoE is weight-bandwidth-bound, so activation precision doesn't move it and Marlin-W4A16 already matches Cutlass-W4A8.
- vs FP4-marlin: a real ~9–13% win — int4→Marlin beats nvfp4→Marlin, so W4A8 (and W4A16) beat the FP4 base. FP4-marlin is what production currently runs, so W4A8/W4A16 are meaningfully faster than the deployed config on the same 2-GPU footprint.
- The FP8-TP4 cell's low absolute TTFT (892 ms) is tensor-parallel scaling (2× the GPUs); per-GPU, W4A8-TP2 still wins (7410 > 6888).
Per-GPU throughput spans a narrow ~6.8–7.4k tok/s/GPU band across all cells — the architecture sets a ceiling — but within that band W4A8 sits at the top.
TP4 for this checkpoint is not yet benched — see To-do. Given W4A8-TP2 already leads on both TTFT and per-GPU, W4A8-TP4 is the most likely config to beat the FP8-TP4 892 ms absolute latency.
Why the activation-precision lever doesn't help
At prefill batch sizes, the DeepSeek-V4-Flash MoE (top-6 of 256 small experts) is weight-bandwidth-bound, not compute-bound on the expert GEMM. INT4 weights are already the bandwidth-optimal format, and Marlin's INT4→BF16 path already matches the Cutlass W4A8 kernel in practice. Switching activations from BF16/FP8-implicit to FP8 changes the activation precision but not the dominant cost. The compute-bound portion of prefill is dominated by format-shared work — FP8-block MLA attention and the sparse / Lightning-indexer passes over long context — which is identical across all three checkpoints.
The prefill ceiling is architectural on Hopper
- Prefill scales linearly above
8k tokens (+547 ms per +8k) with GPUs at ~100% util and ~690 W (near TDP) → tensor-core-bound, not launch- or attention-quadratic-bound. - The two kernel improvements that would help — native NVFP4 MoE GEMM and the FP4 Lightning-indexer cache — are Blackwell-only (SM100). On Hopper, sglang/vLLM fall back to Marlin.
- A W4A8 SM90 grouped-GEMM tuned for the DeepSeek-V4 MoE path is unimplemented upstream (relevant issues closed inactive). Even so, the wash above suggests it would offer little at prefill batch-M.
What does move the needle (deployment)
- Prefix caching is the dominant lever: in production, DeepSeek-V4-Flash realizes ~55% radix prefix-cache hit on real agent/RAG traffic (measured over 24h), i.e. more than half of all prefill is skipped. This is already captured by sglang RadixAttention in production.
- Larger chunked-prefill (8192 → 16384) gives ~7% faster long-context prefill TTFT on sglang, at the cost of KV-concurrency — a free win when the server isn't KV-bound.
Bottom line
W4A8 is the best-measured DeepSeek-V4-Flash config on Hopper at TP2 — top prefill TTFT and top per-GPU throughput. It ties its W4A16 sibling (so the 2× hypothesis failed), but it beats the FP4-marlin config that ships in production by ~9–13% on the same footprint. The practical catch is that this checkpoint format loads on vLLM only, so capturing that win over an sglang FP4 deployment means an engine switch, not a config swap. The dominant serving lever remains prefix caching (55% radix hit in prod); larger absolute-latency wins beyond this need Blackwell (native NVFP4 + FP4 indexer).
To-do
- Bench TP4 for this checkpoint. W4A8-TP2 already leads the matrix on TTFT and per-GPU; W4A8-TP4 is the strongest candidate to beat the FP8-TP4 892 ms absolute TTFT while keeping INT4 weight footprint. (Not yet run.)
Reproducibility
- Weights: byte-identical to
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP. - Transformation: the single
config.jsoninput_activationsedit shown above (see the_w4a8_conversionprovenance key). - To rebuild: take the W4A16 base, apply the config edit, serve with the vLLM nightly + patches above.
Acknowledgements
Built and benchmarked by Evrard Nil with Claude (2026-06). Base quantization by canada-quant; original model by DeepSeek-AI.
- Downloads last month
- -
Model tree for endnai/DeepSeek-V4-Flash-W4A8-FP8
Base model
deepseek-ai/DeepSeek-V4-Flash