RedHatAI/Qwen3-30B-A3B-Instruct-2507-speculator.dflash
This is a DFlash speculator model for Qwen/Qwen3-30B-A3B-Instruct-2507.
Training Details
This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen/Qwen3-235B-A22B-Instruct-2507. and stored at Dataset-Qwen3-235B-Instruct
Commands
Using the Speculators library and the helper scripts provided in the repo.
Prepare data
# In virtual environment with speculators installed
python scripts/prepare_data.py \
--model Qwen/Qwen3-30B-A3B-Instruct-2507
--data ./regenerated_data.jsonl \
--assistant-pattern "<\|im_start\|>assistant\s*([\s\S]*?)<\|im_end\|>" \
--output ./output \
--seq-length 16384
Launch vLLM
# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
Qwen/Qwen3-30B-A3B-Instruct-2507 \
--target-layer-ids 1 12 23 34 45 \
--max-model-len 32768 \
--max-num-batched-tokens 32768\
--tensor-parallel-size 2 \
--no-enable-chunked-prefill
Launch training
Must be run once vLLM has finished launching and is running in the background.
# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
--standalone \
--nproc_per_node 2 \
scripts/train.py \
--verifier-name-or-path Qwen/Qwen3-30B-A3B-Instruct-2507 \
--data-path ./output \
--on-missing generate \
--on-generate delete \
--scheduler-type cosine \
--draft-vocab-size 32000 \
--max-anchors 1024 \
--target-layer-ids 1 12 23 34 45 \
--speculator-type dflash \
--num-layers 5 \
--logger trackio \
--lr 0.0006 \
--epochs 5 \
--sliding-window 2048 \
--sliding-window-indices 0 1 2 3 4 \
--draft-hidden-act silu
Model Specifications
| Base Model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Chat Template | Qwen/Qwen3-30B-A3B-Instruct-2507 (use /chat/completions endpoint) |
| Format | Safetensors |
| License | Apache 2.0 |
| Validation Hardware | Nvidia A100 |
Deployment
# Install vLLM from the required PR
pip install git+https://github.com/vllm-project/vllm.git
# Deploy with speculative decoding
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \
--tensor-parallel-size 2 \
--max-num-batched-tokens 32768 \
--attention-backend FLASH_ATTN \
--speculative-config '{
"model": "RedHatAI/Qwen3-30B-A3B-Instruct-2507-speculator.dflash",
"num_speculative_tokens": 15,
"method": "dflash"
}'
Preliminary Evaluations
Per-position token acceptance rates across datasets:
(with reasoning enabled)
| Dataset | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Pos 8 | Pos 9 | Pos 10 | Pos 11 | Pos 12 | Pos 13 | Pos 14 | Pos 15 | Avg Acceptance Rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HumanEval | 80.78% | 61.07% | 45.17% | 33.18% | 24.47% | 17.83% | 12.49% | 7.89% | 4.39% | 1.99% | 0.79% | 0.24% | 0.07% | 0.00% | 0.00% | 19.34% |
| math_reasoning | 84.71% | 67.42% | 52.57% | 40.94% | 31.17% | 23.41% | 17.22% | 11.15% | 6.45% | 2.95% | 1.14% | 0.46% | 0.22% | 0.07% | 0.00% | 22.66% |
| qa | 59.36% | 30.69% | 15.48% | 7.62% | 3.46% | 1.79% | 0.69% | 0.30% | 0.10% | 0.05% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 7.99% |
| question | 66.95% | 39.10% | 22.94% | 14.08% | 8.91% | 6.07% | 3.84% | 2.22% | 1.23% | 0.49% | 0.18% | 0.05% | 0.02% | 0.02% | 0.00% | 11.06% |
| rag | 63.58% | 35.87% | 18.79% | 9.92% | 4.99% | 2.28% | 1.07% | 0.38% | 0.12% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 9.11% |
| summarization | 60.85% | 30.37% | 14.09% | 6.58% | 2.84% | 1.32% | 0.51% | 0.12% | 0.01% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 7.79% |
| tool_call | 65.47% | 39.84% | 24.14% | 14.97% | 9.09% | 5.54% | 3.36% | 1.87% | 0.94% | 0.39% | 0.12% | 0.06% | 0.00% | 0.00% | 0.00% | 11.03% |
| translation | 68.00% | 34.60% | 14.40% | 5.40% | 1.80% | 0.70% | 0.20% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 8.30% |
| writing | 65.98% | 38.84% | 22.62% | 14.32% | 9.14% | 5.90% | 3.93% | 2.45% | 1.22% | 0.61% | 0.22% | 0.07% | 0.00% | 0.00% | 0.00% | 11.00% |
References
Paper: DFlash: Block Diffusion for Flash Speculative Decoding
- Downloads last month
- 12