RedHatAI/Qwen3-8B-speculator.peagle

This is a DFlash speculator model for Qwen/Qwen3-8B.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen3-8B (with reasoning).

Commands

Using the Speculators library and the helper scripts provided in the repo.

Prepare data

# In virtual environment with speculators installed
python scripts/prepare_data.py \
  --model Qwen/Qwen3-8B
  --data ./regenerated_data.jsonl \
  --output ./output \
  --seq-length 8192

Launch vLLM

# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
  Qwen/Qwen3-8B \
  --target-layer-ids 2 10 18 26 34 \
  -- --port 8000 \
  --gpu-memory-utilization 0.9 \
  --disable-uvicorn-access-log \
  --tensor-parallel-size 1 \
  --data-parallel-size 2

Launch training

Must be run once vLLM has finished launching and is running in the background.

# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
    --standalone --nproc_per_node 2 \
    scripts/train.py \
    --verifier-name-or-path qwen/qwen3-8b \
    --data-path output \
    --vllm-endpoint http://localhost:8000/v1 \
    --hidden-states-path output/hidden_states \
    --save-path output/checkpoints \
    --epochs 5 \
    --lr 6e-4 \
    --total-seq-len 8192 \
    --speculator-type peagle \
    --num-layers 4 \
    --num-depths 7 \
    --down-sample-ratio 0.6 \
    --down-sample-ratio-min 0.2 \
    --no-norm-before-residual \
    --scheduler-type cosine \
    --on-missing generate \
    --on-generate delete

Model Specifications


Base Model	Qwen/Qwen3-8B
Chat Template	Qwen/Qwen3-8B (use `/chat/completions` endpoint)
Format	Safetensors
License	Apache 2.0
Validation Hardware	Nvidia H100

Deployment

# Install vLLM                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                          
# Deploy with speculative decoding                                                                                                                                                                                                                                                                        
vllm serve RedHatAI/Qwen3-8B-speculator.peagle

Preliminary Evaluations

Per-position token acceptance rates across datasets:
(with reasoning enabled)

Dataset	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Avg Length
HumanEval	81.3%	59.0%	41.1%	27.9%	18.8%	12.8%	8.9%	3.500
math_reasoning	83.3%	63.5%	47.0%	34.3%	24.4%	17.2%	11.8%	3.820
qa	70.5%	44.7%	27.6%	17.1%	10.8%	7.1%	4.8%	2.830
question	74.6%	49.6%	31.6%	20.2%	13.1%	8.5%	5.6%	3.030
rag	73.6%	48.4%	29.8%	18.4%	11.3%	6.9%	4.1%	2.930
summarization	68.0%	39.0%	21.0%	10.8%	5.4%	2.6%	1.2%	2.480
tool_call	73.7%	47.6%	28.7%	17.1%	10.3%	6.2%	3.7%	2.870
translation	73.8%	47.7%	28.7%	17.3%	10.4%	6.5%	4.1%	2.890
writing	75.0%	50.0%	32.1%	20.6%	13.3%	8.7%	5.7%	3.050