RedHatAI/DeepSeek-V4-Flash-speculator.dflash

This is a DFlash speculator model for deepseek-ai/DeepSeek-V4-Flash.

The draft is a 5-layer all-sliding-window DFlash model (every layer uses sliding-window attention with a window of 2048) trained with the Muon optimizer. It consumes multi-stream hidden states from the base model (hc_mult=4) and predicts up to 7 speculative tokens.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k (~500K samples total). Hidden states were generated online by a live vLLM server serving DeepSeek-V4-Flash (data-parallel + expert-parallel), rather than pre-generated offline.

Configuration

Using the Speculators library.

Key hyperparameters:


Draft architecture	DFlash (llama-style transformer), 5 layers
Attention	all sliding-window, `sliding_window=2048`
Optimizer	Muon (`muon_lr=0.02`, `lr=6e-4`)
Aux hidden-state layer IDs	`3 13 23 32 42`
`hc_mult`	4 (target hidden width `4 x 4096 = 16384`)
Draft vocab size	32000
Block size	8
Max anchors	3072
Sequence length	8192
Speculative tokens	7

Model Specifications


Base Model	deepseek-ai/DeepSeek-V4-Flash
Chat Template	deepseek-ai/DeepSeek-V4-Flash (use `/chat/completions` endpoint)
Draft Layers	5 (all sliding-window, window 2048)
Optimizer	Muon
Format	Safetensors
Validation Hardware	Nvidia H200

Deployment

Serving this speculator requires a build of vLLM with DeepSeek-V4-Flash DFlash speculative-decoding support (multi-stream hidden-state extraction and sliding-window DFlash drafting). Deploy with a speculative-decoding config pointing at this repo:

vllm serve deepseek-ai/DeepSeek-V4-Flash \
    --speculative-config '{
        "model": "RedHatAI/DeepSeek-V4-Flash-speculator.dflash",
        "num_speculative_tokens": 7,
        "method": "dflash"
    }'

Preliminary Evaluations

Held-out validation set, per-position token acceptance (greedy):

Metric	Value
Position 1	78.8%
Position 2	58.8%
Position 3	45.4%
Position 4	35.9%
Position 5	29.0%
Position 6	23.6%
Position 7	19.3%
Full-sequence acceptance	41.6%
Validation loss	1.093
Approx. mean accepted length	~3.9 tokens

References

Paper: DFlash: Block Diffusion for Flash Speculative Decoding

Downloads last month: 11

Safetensors

Model size

2B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RedHatAI/DeepSeek-V4-Flash-speculator.dflash

Base model

deepseek-ai/DeepSeek-V4-Flash

Finetuned

(16)

this model

Paper for RedHatAI/DeepSeek-V4-Flash-speculator.dflash

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 89