RedHatAI/Qwen3-30B-A3B-Instruct-2507-speculator.dflash

This is a DFlash speculator model for Qwen/Qwen3-30B-A3B-Instruct-2507.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen/Qwen3-235B-A22B-Instruct-2507. and stored at Dataset-Qwen3-235B-Instruct

Commands

Using the Speculators library and the helper scripts provided in the repo.

Prepare data

# In virtual environment with speculators installed
python scripts/prepare_data.py \
  --model Qwen/Qwen3-30B-A3B-Instruct-2507
  --data ./regenerated_data.jsonl \
  --assistant-pattern "<\|im_start\|>assistant\s*([\s\S]*?)<\|im_end\|>" \
  --output ./output \
  --seq-length 16384

Launch vLLM

# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
  Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --target-layer-ids 1 12 23 34 45 \
  --max-model-len  32768 \
  --max-num-batched-tokens 32768\
  --tensor-parallel-size 2 \
  --no-enable-chunked-prefill

Launch training

Must be run once vLLM has finished launching and is running in the background.

# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
  --standalone \
  --nproc_per_node 2 \
  scripts/train.py \
  --verifier-name-or-path Qwen/Qwen3-30B-A3B-Instruct-2507 \     
  --data-path ./output \    
  --on-missing generate \    
  --on-generate delete \    
  --scheduler-type cosine \    
  --draft-vocab-size 32000 \    
  --max-anchors 1024 \    
  --target-layer-ids 1 12 23 34 45 \
  --speculator-type dflash \    
  --num-layers 5 \    
  --logger trackio  \    
  --lr 0.0006 \    
  --epochs 5 \    
  --sliding-window 2048 \    
  --sliding-window-indices 0 1 2 3 4 \    
  --draft-hidden-act silu 

Model Specifications

Base Model Qwen/Qwen3-30B-A3B-Instruct-2507
Chat Template Qwen/Qwen3-30B-A3B-Instruct-2507 (use /chat/completions endpoint)
Format Safetensors
License Apache 2.0
Validation Hardware Nvidia A100

Deployment

# Install vLLM from the required PR
pip install git+https://github.com/vllm-project/vllm.git     
                                                                                                                                                                                                                                                                                                          
# Deploy with speculative decoding                                                                                                                                                                                                                                                                        
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \                                                                                                                                                                                                                                                                                
    --tensor-parallel-size 2 \                                                                                                                                                                                                                                                                            
    --max-num-batched-tokens 32768 \
    --attention-backend FLASH_ATTN \ 
    --speculative-config '{                                                                                                                                                                                                                                                                               
        "model": "RedHatAI/Qwen3-30B-A3B-Instruct-2507-speculator.dflash",                                                                                                                                                                                                                                                   
        "num_speculative_tokens": 15,                                                                                                                                                                                                                                                                      
        "method": "dflash"                                                                                                                                                                                                                                                                                
    }'

Preliminary Evaluations

Per-position token acceptance rates across datasets:
(with reasoning enabled)

Dataset Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6 Pos 7 Pos 8 Pos 9 Pos 10 Pos 11 Pos 12 Pos 13 Pos 14 Pos 15 Avg Acceptance Rate
HumanEval 80.78% 61.07% 45.17% 33.18% 24.47% 17.83% 12.49% 7.89% 4.39% 1.99% 0.79% 0.24% 0.07% 0.00% 0.00% 19.34%
math_reasoning 84.71% 67.42% 52.57% 40.94% 31.17% 23.41% 17.22% 11.15% 6.45% 2.95% 1.14% 0.46% 0.22% 0.07% 0.00% 22.66%
qa 59.36% 30.69% 15.48% 7.62% 3.46% 1.79% 0.69% 0.30% 0.10% 0.05% 0.00% 0.00% 0.00% 0.00% 0.00% 7.99%
question 66.95% 39.10% 22.94% 14.08% 8.91% 6.07% 3.84% 2.22% 1.23% 0.49% 0.18% 0.05% 0.02% 0.02% 0.00% 11.06%
rag 63.58% 35.87% 18.79% 9.92% 4.99% 2.28% 1.07% 0.38% 0.12% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 9.11%
summarization 60.85% 30.37% 14.09% 6.58% 2.84% 1.32% 0.51% 0.12% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 7.79%
tool_call 65.47% 39.84% 24.14% 14.97% 9.09% 5.54% 3.36% 1.87% 0.94% 0.39% 0.12% 0.06% 0.00% 0.00% 0.00% 11.03%
translation 68.00% 34.60% 14.40% 5.40% 1.80% 0.70% 0.20% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 8.30%
writing 65.98% 38.84% 22.62% 14.32% 9.14% 5.90% 3.93% 2.45% 1.22% 0.61% 0.22% 0.07% 0.00% 0.00% 0.00% 11.00%

References

Paper: DFlash: Block Diffusion for Flash Speculative Decoding

Downloads last month
12
Safetensors
Model size
0.7B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for inference-optimization/Qwen3-30B-A3B-Instruct-2507-speculator.dflash