DeepSeek-V4-Flash-DSpark

Stock DeepSeek-V4-Flash with DSpark speculative decoding. Faster than MTP on 2-4x RTX Pro 6000. By Fraser Price.

The official DeepSeek-V4-Flash (FP8 weights + FP4 experts, unmodified) with DeepSeek's DSpark draft module bolted on. DSpark (from DeepSpec) is block-wise speculative decoding: it drafts several tokens per step through a small Markov head, so single-stream decode is ~1.15-1.25x faster than the model's stock MTP at identical quality (speculative decoding is lossless). No released engine ships DSpark yet, so it runs on a purpose-built vLLM image (fraserpricee/vllm:dspark-cu132-20260627, source github.com/fraserprice/dspark-vllm).

Related models

Repo What
deepseek-ai/DeepSeek-V4-Flash the official base model
DeepSeek-V4-Flash-DSpark this repo (stock + DSpark)
DeepSeek-V4-Flash-Abliterated-DSpark abliterated (uncensored) + DSpark

Quick start

Needs Docker with the NVIDIA Container Toolkit and 2-4 GPUs (see Hardware).

curl -fsSL https://huggingface.co/fraserprice/DeepSeek-V4-Flash-DSpark/resolve/main/run.sh -o run.sh && bash run.sh

That one command grabs the launcher, downloads the weights (~167 GB) into your Hugging Face cache, pulls the inference image, and serves an OpenAI-compatible API on http://localhost:8000/v1. No other files or arguments needed. On 4 GPUs:

GPUS=0,1,2,3 TP=4 bash run.sh

Everything is overridable by env var:

Variable Default Meaning
GPUS 0,1 GPU indices (comma-separated)
TP 2 Tensor-parallel size; set to the number of GPUs
DSPARK_TOKENS 4 DSpark draft tokens per step
PORT 8000 API port
MAX_MODEL_LEN 262144 Context length
GPU_MEM_UTIL 0.96 Fraction of VRAM to use
HF_REPO fraserprice/DeepSeek-V4-Flash-DSpark repo to download
MODEL_DIR (HF cache) serve a local dir instead of downloading
SERVED_NAME DeepSeek-V4-Flash-DSpark API model name
IMAGE (pinned DSpark image) inference container

Hardware

Tuned for NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120). Needs roughly 180 GB total VRAM for weights + DSpark module + KV cache at long context, split across GPUs by tensor parallelism.

Setup Works?
2 x RTX Pro 6000 (96 GB), TP=2 ✅ reference config
4 x RTX Pro 6000, TP=4 ✅ more KV headroom / throughput
< ~180 GB total VRAM ❌ won't fit

The serving image targets Blackwell (sm_120) / CUDA 13.2 and adds the dspark speculative method to vLLM. See github.com/fraserprice/dspark-vllm for the overlay source and build.

Performance

DSpark vs the stock single-layer MTP, same image and settings: RTX Pro 6000 Blackwell (PCIe 5.0), FP8 weights + FP8 KV cache, gpu_memory_utilization=0.96, 128 generated tokens/request, DSpark num_speculative_tokens=4 vs MTP =2. PP is prefill throughput (tok/s), identical for both since speculative decoding only affects decode; TG is per-request decode throughput (tok/s); Total is aggregate across concurrent requests.

These numbers were measured on the abliterated DSpark twin and apply identically here: same architecture, precision, and DSpark module. Abliteration is a rank-3 projection on the base weights that does not change compute.

GPUs prompt conc PP MTP TG DSpark TG Decode ↑ MTP total DSpark total
2 1,000 1 8,292 190.9 225.1 1.18× 1,449 1,660
2 1,000 3 8,626 138.4 159.5 1.15× 2,854 3,122
2 10,000 1 7,021 191.5 228.4 1.19× 4,853 6,648
2 10,000 3 10,240 90.3 97.6 1.08× 8,022 8,162
2 100,000 1 8,852 185.4 214.5 1.16× 8,355 8,371
2 100,000 3 8,794 60.4 67.5 1.12× 8,610 8,687
4 1,000 1 9,655 233.3 285.9 1.23× 1,757 2,093
4 1,000 3 10,482 181.9 216.3 1.19× 3,632 4,107
4 10,000 1 7,898 234.5 289.8 1.24× 5,604 6,046
4 10,000 3 12,309 115.0 130.1 1.13× 9,798 10,678
4 100,000 1 9,627 232.8 279.4 1.20× 9,156 10,350
4 100,000 3 9,399 74.7 91.5 1.22× 9,231 9,788

~1.15-1.25x faster single-stream decode than the shipped MTP, at the same (lossless) quality.

About this build

A local-inference DSpark implementation, tuned for 2-4 GPUs at low concurrency:

  • No confidence-scheduled verification. Full DSpark uses the draft's prefix survival probabilities to dynamically pick how many tokens to verify per step, mostly a throughput win at high batch concurrency. This build verifies a fixed DSPARK_TOKENS block with standard probabilistic rejection. For local low-concurrency serving that costs little and still beats MTP; for high-volume production, use upstream DeepSpec.
  • Lossless. Accepted tokens are exactly the target model's, so output is identical to stock DeepSeek-V4-Flash. DSpark only changes speed, not the distribution.

Credits

License

MIT, inheriting from the base model.


Built by Fraser Price, @fraserpricee. Found this useful? A follow is appreciated.

Downloads last month
448
Safetensors
Model size
165B params
Tensor type
BF16
·
I64
·
F32
·
F8_E8M0
·
F8_E4M3
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraserprice/DeepSeek-V4-Flash-DSpark

Quantized
(84)
this model