DeepSeek-V4-Flash-DSpark

Stock DeepSeek-V4-Flash with DSpark speculative decoding. Faster than MTP on 2-4x RTX Pro 6000. By Fraser Price.

The official DeepSeek-V4-Flash (FP8 weights + FP4 experts, unmodified) with DeepSeek's DSpark draft module bolted on. DSpark (from DeepSpec) is block-wise speculative decoding: it drafts several tokens per step through a small Markov head, so single-stream decode is ~1.15-1.25x faster than the model's stock MTP at identical quality (speculative decoding is lossless). No released engine ships DSpark yet, so it runs on a purpose-built vLLM image (fraserpricee/vllm:dspark-cu132-20260627, source github.com/fraserprice/dspark-vllm).

Related models

Repo	What
deepseek-ai/DeepSeek-V4-Flash	the official base model
DeepSeek-V4-Flash-DSpark	this repo (stock + DSpark)
DeepSeek-V4-Flash-Abliterated-DSpark	abliterated (uncensored) + DSpark

Quick start

Needs Docker with the NVIDIA Container Toolkit and 2-4 GPUs (see Hardware).

curl -fsSL https://huggingface.co/fraserprice/DeepSeek-V4-Flash-DSpark/resolve/main/run.sh -o run.sh && bash run.sh

That one command grabs the launcher, downloads the weights (~167 GB) into your Hugging Face cache, pulls the inference image, and serves an OpenAI-compatible API on http://localhost:8000/v1. No other files or arguments needed. On 4 GPUs:

GPUS=0,1,2,3 TP=4 bash run.sh

Everything is overridable by env var:

Variable	Default	Meaning
`GPUS`	`0,1`	GPU indices (comma-separated)
`TP`	`2`	Tensor-parallel size; set to the number of GPUs
`DSPARK_TOKENS`	`4`	DSpark draft tokens per step
`PORT`	`8000`	API port
`MAX_MODEL_LEN`	`262144`	Context length
`GPU_MEM_UTIL`	`0.96`	Fraction of VRAM to use
`HF_REPO`	`fraserprice/DeepSeek-V4-Flash-DSpark`	repo to download
`MODEL_DIR`	(HF cache)	serve a local dir instead of downloading
`SERVED_NAME`	`DeepSeek-V4-Flash-DSpark`	API model name
`IMAGE`	(pinned DSpark image)	inference container

Hardware

Tuned for NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120). Needs roughly 180 GB total VRAM for weights + DSpark module + KV cache at long context, split across GPUs by tensor parallelism.

Setup	Works?
2 x RTX Pro 6000 (96 GB), `TP=2`	✅ reference config
4 x RTX Pro 6000, `TP=4`	✅ more KV headroom / throughput
< ~180 GB total VRAM	❌ won't fit

The serving image targets Blackwell (sm_120) / CUDA 13.2 and adds the dspark speculative method to vLLM. See github.com/fraserprice/dspark-vllm for the overlay source and build.

Performance

DSpark vs the stock single-layer MTP, same image and settings: RTX Pro 6000 Blackwell (PCIe 5.0), FP8 weights + FP8 KV cache, gpu_memory_utilization=0.96, 128 generated tokens/request, DSpark num_speculative_tokens=4 vs MTP =2. PP is prefill throughput (tok/s), identical for both since speculative decoding only affects decode; TG is per-request decode throughput (tok/s); Total is aggregate across concurrent requests.

These numbers were measured on the abliterated DSpark twin and apply identically here: same architecture, precision, and DSpark module. Abliteration is a rank-3 projection on the base weights that does not change compute.

GPUs	prompt	conc	PP	MTP TG	DSpark TG	Decode ↑	MTP total	DSpark total
2	1,000	1	8,292	190.9	225.1	1.18×	1,449	1,660
2	1,000	3	8,626	138.4	159.5	1.15×	2,854	3,122
2	10,000	1	7,021	191.5	228.4	1.19×	4,853	6,648
2	10,000	3	10,240	90.3	97.6	1.08×	8,022	8,162
2	100,000	1	8,852	185.4	214.5	1.16×	8,355	8,371
2	100,000	3	8,794	60.4	67.5	1.12×	8,610	8,687
4	1,000	1	9,655	233.3	285.9	1.23×	1,757	2,093
4	1,000	3	10,482	181.9	216.3	1.19×	3,632	4,107
4	10,000	1	7,898	234.5	289.8	1.24×	5,604	6,046
4	10,000	3	12,309	115.0	130.1	1.13×	9,798	10,678
4	100,000	1	9,627	232.8	279.4	1.20×	9,156	10,350
4	100,000	3	9,399	74.7	91.5	1.22×	9,231	9,788

~1.15-1.25x faster single-stream decode than the shipped MTP, at the same (lossless) quality.

About this build

A local-inference DSpark implementation, tuned for 2-4 GPUs at low concurrency:

No confidence-scheduled verification. Full DSpark uses the draft's prefix survival probabilities to dynamically pick how many tokens to verify per step, mostly a throughput win at high batch concurrency. This build verifies a fixed DSPARK_TOKENS block with standard probabilistic rejection. For local low-concurrency serving that costs little and still beats MTP; for high-volume production, use upstream DeepSpec.
Lossless. Accepted tokens are exactly the target model's, so output is identical to stock DeepSeek-V4-Flash. DSpark only changes speed, not the distribution.

Credits

deepseek-ai/DeepSeek-V4-Flash: the base model (MIT).
DeepSeek-AI / DeepSpec and DeepSeek-V4-Flash-DSpark: the DSpark technique and draft weights.
voipmonitor/vllm: the Blackwell / CUDA-13.2 vLLM base image.

License

MIT, inheriting from the base model.

Built by Fraser Price, @fraserpricee. Found this useful? A follow is appreciated.

Downloads last month: 448

Safetensors

Model size

165B params

Tensor type

BF16

I64

F32

F8_E8M0

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraserprice/DeepSeek-V4-Flash-DSpark

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(84)

this model