RedHatAI/Qwen3-8B-speculator.peagle

This is a DFlash speculator model for Qwen/Qwen3-8B.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen3-8B (with reasoning).

Commands

Using the Speculators library and the helper scripts provided in the repo.

Prepare data

# In virtual environment with speculators installed
python scripts/prepare_data.py \
  --model Qwen/Qwen3-8B
  --data ./regenerated_data.jsonl \
  --output ./output \
  --seq-length 8192

Launch vLLM

# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
  Qwen/Qwen3-8B \
  --target-layer-ids 2 10 18 26 34 \
  -- --port 8000 \
  --gpu-memory-utilization 0.9 \
  --disable-uvicorn-access-log \
  --tensor-parallel-size 1 \
  --data-parallel-size 2

Launch training

Must be run once vLLM has finished launching and is running in the background.

# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
    --standalone --nproc_per_node 2 \
    scripts/train.py \
    --verifier-name-or-path qwen/qwen3-8b \
    --data-path output \
    --vllm-endpoint http://localhost:8000/v1 \
    --hidden-states-path output/hidden_states \
    --save-path output/checkpoints \
    --epochs 5 \
    --lr 6e-4 \
    --total-seq-len 8192 \
    --speculator-type peagle \
    --num-layers 4 \
    --num-depths 7 \
    --down-sample-ratio 0.6 \
    --down-sample-ratio-min 0.2 \
    --no-norm-before-residual \
    --scheduler-type cosine \
    --on-missing generate \
    --on-generate delete

Model Specifications

Base Model Qwen/Qwen3-8B
Chat Template Qwen/Qwen3-8B (use /chat/completions endpoint)
Format Safetensors
License Apache 2.0
Validation Hardware Nvidia H100

Deployment

# Install vLLM                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                          
# Deploy with speculative decoding                                                                                                                                                                                                                                                                        
vllm serve RedHatAI/Qwen3-8B-speculator.peagle

Preliminary Evaluations

Per-position token acceptance rates across datasets:
(with reasoning enabled)

Dataset Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6 Pos 7 Avg Length
HumanEval 81.3% 59.0% 41.1% 27.9% 18.8% 12.8% 8.9% 3.500
math_reasoning 83.3% 63.5% 47.0% 34.3% 24.4% 17.2% 11.8% 3.820
qa 70.5% 44.7% 27.6% 17.1% 10.8% 7.1% 4.8% 2.830
question 74.6% 49.6% 31.6% 20.2% 13.1% 8.5% 5.6% 3.030
rag 73.6% 48.4% 29.8% 18.4% 11.3% 6.9% 4.1% 2.930
summarization 68.0% 39.0% 21.0% 10.8% 5.4% 2.6% 1.2% 2.480
tool_call 73.7% 47.6% 28.7% 17.1% 10.3% 6.2% 3.7% 2.870
translation 73.8% 47.7% 28.7% 17.3% 10.4% 6.5% 4.1% 2.890
writing 75.0% 50.0% 32.1% 20.6% 13.3% 8.7% 5.7% 3.050

References

Paper: P-EAGLE: Parallel-Drafting EAGLE with Scalable Training

Downloads last month
11
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RedHatAI/Qwen3-8B-speculator.peagle

Finetuned
Qwen/Qwen3-8B
Finetuned
(1725)
this model

Paper for RedHatAI/Qwen3-8B-speculator.peagle