RedHatAI/DeepSeek-V4-Flash-speculator.dflash

This is a DFlash speculator model for deepseek-ai/DeepSeek-V4-Flash.

The draft is a 5-layer all-sliding-window DFlash model (every layer uses sliding-window attention with a window of 2048) trained with the Muon optimizer. It consumes multi-stream hidden states from the base model (hc_mult=4) and predicts up to 7 speculative tokens.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k (~500K samples total). Hidden states were generated online by a live vLLM server serving DeepSeek-V4-Flash (data-parallel + expert-parallel), rather than pre-generated offline.

Configuration

Using the Speculators library.

Key hyperparameters:

Draft architecture DFlash (llama-style transformer), 5 layers
Attention all sliding-window, sliding_window=2048
Optimizer Muon (muon_lr=0.02, lr=6e-4)
Aux hidden-state layer IDs 3 13 23 32 42
hc_mult 4 (target hidden width 4 x 4096 = 16384)
Draft vocab size 32000
Block size 8
Max anchors 3072
Sequence length 8192
Speculative tokens 7

Model Specifications

Base Model deepseek-ai/DeepSeek-V4-Flash
Chat Template deepseek-ai/DeepSeek-V4-Flash (use /chat/completions endpoint)
Draft Layers 5 (all sliding-window, window 2048)
Optimizer Muon
Format Safetensors
Validation Hardware Nvidia H200

Deployment

Serving this speculator requires a build of vLLM with DeepSeek-V4-Flash DFlash speculative-decoding support (multi-stream hidden-state extraction and sliding-window DFlash drafting). Deploy with a speculative-decoding config pointing at this repo:

vllm serve deepseek-ai/DeepSeek-V4-Flash \
    --speculative-config '{
        "model": "RedHatAI/DeepSeek-V4-Flash-speculator.dflash",
        "num_speculative_tokens": 7,
        "method": "dflash"
    }'

Preliminary Evaluations

Held-out validation set, per-position token acceptance (greedy):

Metric Value
Position 1 78.8%
Position 2 58.8%
Position 3 45.4%
Position 4 35.9%
Position 5 29.0%
Position 6 23.6%
Position 7 19.3%
Full-sequence acceptance 41.6%
Validation loss 1.093
Approx. mean accepted length ~3.9 tokens

References

Paper: DFlash: Block Diffusion for Flash Speculative Decoding

Downloads last month
11
Safetensors
Model size
2B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RedHatAI/DeepSeek-V4-Flash-speculator.dflash

Finetuned
(16)
this model

Paper for RedHatAI/DeepSeek-V4-Flash-speculator.dflash