RedHatAI/DeepSeek-V4-Flash-speculator.dflash
This is a DFlash speculator model for deepseek-ai/DeepSeek-V4-Flash.
The draft is a 5-layer all-sliding-window DFlash model (every layer uses sliding-window
attention with a window of 2048) trained with the Muon optimizer. It consumes multi-stream
hidden states from the base model (hc_mult=4) and predicts up to 7 speculative tokens.
Training Details
This model was trained using the Speculators
library on a subset of
Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered
and the train_sft split of
HuggingFaceH4/ultrachat_200k
(~500K samples total). Hidden states were generated online by a live vLLM server serving
DeepSeek-V4-Flash (data-parallel + expert-parallel), rather than pre-generated offline.
Configuration
Using the Speculators library.
Key hyperparameters:
| Draft architecture | DFlash (llama-style transformer), 5 layers |
| Attention | all sliding-window, sliding_window=2048 |
| Optimizer | Muon (muon_lr=0.02, lr=6e-4) |
| Aux hidden-state layer IDs | 3 13 23 32 42 |
hc_mult |
4 (target hidden width 4 x 4096 = 16384) |
| Draft vocab size | 32000 |
| Block size | 8 |
| Max anchors | 3072 |
| Sequence length | 8192 |
| Speculative tokens | 7 |
Model Specifications
| Base Model | deepseek-ai/DeepSeek-V4-Flash |
| Chat Template | deepseek-ai/DeepSeek-V4-Flash (use /chat/completions endpoint) |
| Draft Layers | 5 (all sliding-window, window 2048) |
| Optimizer | Muon |
| Format | Safetensors |
| Validation Hardware | Nvidia H200 |
Deployment
Serving this speculator requires a build of vLLM with DeepSeek-V4-Flash DFlash speculative-decoding support (multi-stream hidden-state extraction and sliding-window DFlash drafting). Deploy with a speculative-decoding config pointing at this repo:
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--speculative-config '{
"model": "RedHatAI/DeepSeek-V4-Flash-speculator.dflash",
"num_speculative_tokens": 7,
"method": "dflash"
}'
Preliminary Evaluations
Held-out validation set, per-position token acceptance (greedy):
| Metric | Value |
|---|---|
| Position 1 | 78.8% |
| Position 2 | 58.8% |
| Position 3 | 45.4% |
| Position 4 | 35.9% |
| Position 5 | 29.0% |
| Position 6 | 23.6% |
| Position 7 | 19.3% |
| Full-sequence acceptance | 41.6% |
| Validation loss | 1.093 |
| Approx. mean accepted length | ~3.9 tokens |
References
Paper: DFlash: Block Diffusion for Flash Speculative Decoding
- Downloads last month
- 11
Model tree for RedHatAI/DeepSeek-V4-Flash-speculator.dflash
Base model
deepseek-ai/DeepSeek-V4-Flash