RedHatAI/Qwen3-30B-A3B-Instruct-2507-speculator.dflash

This is a DFlash speculator model for Qwen/Qwen3-30B-A3B-Instruct-2507.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen/Qwen3-235B-A22B-Instruct-2507. and stored at Dataset-Qwen3-235B-Instruct

Commands

Using the Speculators library and the helper scripts provided in the repo.

Prepare data

# In virtual environment with speculators installed
python scripts/prepare_data.py \
  --model Qwen/Qwen3-30B-A3B-Instruct-2507
  --data ./regenerated_data.jsonl \
  --assistant-pattern "<\|im_start\|>assistant\s*([\s\S]*?)<\|im_end\|>" \
  --output ./output \
  --seq-length 16384

Launch vLLM

# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
  Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --target-layer-ids 1 12 23 34 45 \
  --max-model-len  32768 \
  --max-num-batched-tokens 32768\
  --tensor-parallel-size 2 \
  --no-enable-chunked-prefill

Launch training

Must be run once vLLM has finished launching and is running in the background.

# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
  --standalone \
  --nproc_per_node 2 \
  scripts/train.py \
  --verifier-name-or-path Qwen/Qwen3-30B-A3B-Instruct-2507 \     
  --data-path ./output \    
  --on-missing generate \    
  --on-generate delete \    
  --scheduler-type cosine \    
  --draft-vocab-size 32000 \    
  --max-anchors 1024 \    
  --target-layer-ids 1 12 23 34 45 \
  --speculator-type dflash \    
  --num-layers 5 \    
  --logger trackio  \    
  --lr 0.0006 \    
  --epochs 5 \    
  --sliding-window 2048 \    
  --sliding-window-indices 0 1 2 3 4 \    
  --draft-hidden-act silu

Model Specifications


Base Model	Qwen/Qwen3-30B-A3B-Instruct-2507
Chat Template	Qwen/Qwen3-30B-A3B-Instruct-2507 (use `/chat/completions` endpoint)
Format	Safetensors
License	Apache 2.0
Validation Hardware	Nvidia A100

Deployment

# Install vLLM from the required PR
pip install git+https://github.com/vllm-project/vllm.git     
                                                                                                                                                                                                                                                                                                          
# Deploy with speculative decoding                                                                                                                                                                                                                                                                        
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \                                                                                                                                                                                                                                                                                
    --tensor-parallel-size 2 \                                                                                                                                                                                                                                                                            
    --max-num-batched-tokens 32768 \
    --attention-backend FLASH_ATTN \ 
    --speculative-config '{                                                                                                                                                                                                                                                                               
        "model": "RedHatAI/Qwen3-30B-A3B-Instruct-2507-speculator.dflash",                                                                                                                                                                                                                                                   
        "num_speculative_tokens": 15,                                                                                                                                                                                                                                                                      
        "method": "dflash"                                                                                                                                                                                                                                                                                
    }'

Preliminary Evaluations

Per-position token acceptance rates across datasets:
(with reasoning enabled)

Dataset	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Pos 8	Pos 9	Pos 10	Pos 11	Pos 12	Pos 13	Pos 14	Pos 15	Avg Acceptance Rate
HumanEval	80.78%	61.07%	45.17%	33.18%	24.47%	17.83%	12.49%	7.89%	4.39%	1.99%	0.79%	0.24%	0.07%	0.00%	0.00%	19.34%
math_reasoning	84.71%	67.42%	52.57%	40.94%	31.17%	23.41%	17.22%	11.15%	6.45%	2.95%	1.14%	0.46%	0.22%	0.07%	0.00%	22.66%
qa	59.36%	30.69%	15.48%	7.62%	3.46%	1.79%	0.69%	0.30%	0.10%	0.05%	0.00%	0.00%	0.00%	0.00%	0.00%	7.99%
question	66.95%	39.10%	22.94%	14.08%	8.91%	6.07%	3.84%	2.22%	1.23%	0.49%	0.18%	0.05%	0.02%	0.02%	0.00%	11.06%
rag	63.58%	35.87%	18.79%	9.92%	4.99%	2.28%	1.07%	0.38%	0.12%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	9.11%
summarization	60.85%	30.37%	14.09%	6.58%	2.84%	1.32%	0.51%	0.12%	0.01%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	7.79%
tool_call	65.47%	39.84%	24.14%	14.97%	9.09%	5.54%	3.36%	1.87%	0.94%	0.39%	0.12%	0.06%	0.00%	0.00%	0.00%	11.03%
translation	68.00%	34.60%	14.40%	5.40%	1.80%	0.70%	0.20%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	8.30%
writing	65.98%	38.84%	22.62%	14.32%	9.14%	5.90%	3.93%	2.45%	1.22%	0.61%	0.22%	0.07%	0.00%	0.00%	0.00%	11.00%

References

Paper: DFlash: Block Diffusion for Flash Speculative Decoding

Downloads last month: 12

Safetensors

Model size

0.7B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for inference-optimization/Qwen3-30B-A3B-Instruct-2507-speculator.dflash

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 87