DeepSeek-V4-Flash-DSpark
Stock DeepSeek-V4-Flash with DSpark speculative decoding. Faster than MTP on 2-4x RTX Pro 6000. By Fraser Price.
The official DeepSeek-V4-Flash (FP8 weights + FP4 experts, unmodified) with DeepSeek's DSpark draft module bolted on. DSpark (from DeepSpec) is block-wise speculative decoding: it drafts several tokens per step through a small Markov head, so single-stream decode is ~1.15-1.25x faster than the model's stock MTP at identical quality (speculative decoding is lossless). No released engine ships DSpark yet, so it runs on a purpose-built vLLM image (fraserpricee/vllm:dspark-cu132-20260627, source github.com/fraserprice/dspark-vllm).
Related models
| Repo | What |
|---|---|
| deepseek-ai/DeepSeek-V4-Flash | the official base model |
| DeepSeek-V4-Flash-DSpark | this repo (stock + DSpark) |
| DeepSeek-V4-Flash-Abliterated-DSpark | abliterated (uncensored) + DSpark |
Quick start
Needs Docker with the NVIDIA Container Toolkit and 2-4 GPUs (see Hardware).
curl -fsSL https://huggingface.co/fraserprice/DeepSeek-V4-Flash-DSpark/resolve/main/run.sh -o run.sh && bash run.sh
That one command grabs the launcher, downloads the weights (~167 GB) into your Hugging Face cache, pulls the inference image, and serves an OpenAI-compatible API on http://localhost:8000/v1. No other files or arguments needed. On 4 GPUs:
GPUS=0,1,2,3 TP=4 bash run.sh
Everything is overridable by env var:
| Variable | Default | Meaning |
|---|---|---|
GPUS |
0,1 |
GPU indices (comma-separated) |
TP |
2 |
Tensor-parallel size; set to the number of GPUs |
DSPARK_TOKENS |
4 |
DSpark draft tokens per step |
PORT |
8000 |
API port |
MAX_MODEL_LEN |
262144 |
Context length |
GPU_MEM_UTIL |
0.96 |
Fraction of VRAM to use |
HF_REPO |
fraserprice/DeepSeek-V4-Flash-DSpark |
repo to download |
MODEL_DIR |
(HF cache) | serve a local dir instead of downloading |
SERVED_NAME |
DeepSeek-V4-Flash-DSpark |
API model name |
IMAGE |
(pinned DSpark image) | inference container |
Hardware
Tuned for NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120). Needs roughly 180 GB total VRAM for weights + DSpark module + KV cache at long context, split across GPUs by tensor parallelism.
| Setup | Works? |
|---|---|
2 x RTX Pro 6000 (96 GB), TP=2 |
✅ reference config |
4 x RTX Pro 6000, TP=4 |
✅ more KV headroom / throughput |
| < ~180 GB total VRAM | ❌ won't fit |
The serving image targets Blackwell (sm_120) / CUDA 13.2 and adds the dspark speculative method to vLLM. See github.com/fraserprice/dspark-vllm for the overlay source and build.
Performance
DSpark vs the stock single-layer MTP, same image and settings: RTX Pro 6000 Blackwell (PCIe 5.0), FP8 weights + FP8 KV cache, gpu_memory_utilization=0.96, 128 generated tokens/request, DSpark num_speculative_tokens=4 vs MTP =2. PP is prefill throughput (tok/s), identical for both since speculative decoding only affects decode; TG is per-request decode throughput (tok/s); Total is aggregate across concurrent requests.
These numbers were measured on the abliterated DSpark twin and apply identically here: same architecture, precision, and DSpark module. Abliteration is a rank-3 projection on the base weights that does not change compute.
| GPUs | prompt | conc | PP | MTP TG | DSpark TG | Decode ↑ | MTP total | DSpark total |
|---|---|---|---|---|---|---|---|---|
| 2 | 1,000 | 1 | 8,292 | 190.9 | 225.1 | 1.18× | 1,449 | 1,660 |
| 2 | 1,000 | 3 | 8,626 | 138.4 | 159.5 | 1.15× | 2,854 | 3,122 |
| 2 | 10,000 | 1 | 7,021 | 191.5 | 228.4 | 1.19× | 4,853 | 6,648 |
| 2 | 10,000 | 3 | 10,240 | 90.3 | 97.6 | 1.08× | 8,022 | 8,162 |
| 2 | 100,000 | 1 | 8,852 | 185.4 | 214.5 | 1.16× | 8,355 | 8,371 |
| 2 | 100,000 | 3 | 8,794 | 60.4 | 67.5 | 1.12× | 8,610 | 8,687 |
| 4 | 1,000 | 1 | 9,655 | 233.3 | 285.9 | 1.23× | 1,757 | 2,093 |
| 4 | 1,000 | 3 | 10,482 | 181.9 | 216.3 | 1.19× | 3,632 | 4,107 |
| 4 | 10,000 | 1 | 7,898 | 234.5 | 289.8 | 1.24× | 5,604 | 6,046 |
| 4 | 10,000 | 3 | 12,309 | 115.0 | 130.1 | 1.13× | 9,798 | 10,678 |
| 4 | 100,000 | 1 | 9,627 | 232.8 | 279.4 | 1.20× | 9,156 | 10,350 |
| 4 | 100,000 | 3 | 9,399 | 74.7 | 91.5 | 1.22× | 9,231 | 9,788 |
~1.15-1.25x faster single-stream decode than the shipped MTP, at the same (lossless) quality.
About this build
A local-inference DSpark implementation, tuned for 2-4 GPUs at low concurrency:
- No confidence-scheduled verification. Full DSpark uses the draft's prefix survival probabilities to dynamically pick how many tokens to verify per step, mostly a throughput win at high batch concurrency. This build verifies a fixed
DSPARK_TOKENSblock with standard probabilistic rejection. For local low-concurrency serving that costs little and still beats MTP; for high-volume production, use upstream DeepSpec. - Lossless. Accepted tokens are exactly the target model's, so output is identical to stock DeepSeek-V4-Flash. DSpark only changes speed, not the distribution.
Credits
- deepseek-ai/DeepSeek-V4-Flash: the base model (MIT).
- DeepSeek-AI / DeepSpec and DeepSeek-V4-Flash-DSpark: the DSpark technique and draft weights.
voipmonitor/vllm: the Blackwell / CUDA-13.2 vLLM base image.
License
MIT, inheriting from the base model.
Built by Fraser Price, @fraserpricee. Found this useful? A follow is appreciated.
- Downloads last month
- 448
Model tree for fraserprice/DeepSeek-V4-Flash-DSpark
Base model
deepseek-ai/DeepSeek-V4-Flash