RedHatAI/Qwen3-8B-speculator.peagle
This is a DFlash speculator model for Qwen/Qwen3-8B.
Training Details
This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen3-8B (with reasoning).
Commands
Using the Speculators library and the helper scripts provided in the repo.
Prepare data
# In virtual environment with speculators installed
python scripts/prepare_data.py \
--model Qwen/Qwen3-8B
--data ./regenerated_data.jsonl \
--output ./output \
--seq-length 8192
Launch vLLM
# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
Qwen/Qwen3-8B \
--target-layer-ids 2 10 18 26 34 \
-- --port 8000 \
--gpu-memory-utilization 0.9 \
--disable-uvicorn-access-log \
--tensor-parallel-size 1 \
--data-parallel-size 2
Launch training
Must be run once vLLM has finished launching and is running in the background.
# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
--standalone --nproc_per_node 2 \
scripts/train.py \
--verifier-name-or-path qwen/qwen3-8b \
--data-path output \
--vllm-endpoint http://localhost:8000/v1 \
--hidden-states-path output/hidden_states \
--save-path output/checkpoints \
--epochs 5 \
--lr 6e-4 \
--total-seq-len 8192 \
--speculator-type peagle \
--num-layers 4 \
--num-depths 7 \
--down-sample-ratio 0.6 \
--down-sample-ratio-min 0.2 \
--no-norm-before-residual \
--scheduler-type cosine \
--on-missing generate \
--on-generate delete
Model Specifications
| Base Model | Qwen/Qwen3-8B |
| Chat Template | Qwen/Qwen3-8B (use /chat/completions endpoint) |
| Format | Safetensors |
| License | Apache 2.0 |
| Validation Hardware | Nvidia H100 |
Deployment
# Install vLLM
# Deploy with speculative decoding
vllm serve RedHatAI/Qwen3-8B-speculator.peagle
Preliminary Evaluations
Per-position token acceptance rates across datasets:
(with reasoning enabled)
| Dataset | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg Length |
|---|---|---|---|---|---|---|---|---|
| HumanEval | 81.3% | 59.0% | 41.1% | 27.9% | 18.8% | 12.8% | 8.9% | 3.500 |
| math_reasoning | 83.3% | 63.5% | 47.0% | 34.3% | 24.4% | 17.2% | 11.8% | 3.820 |
| qa | 70.5% | 44.7% | 27.6% | 17.1% | 10.8% | 7.1% | 4.8% | 2.830 |
| question | 74.6% | 49.6% | 31.6% | 20.2% | 13.1% | 8.5% | 5.6% | 3.030 |
| rag | 73.6% | 48.4% | 29.8% | 18.4% | 11.3% | 6.9% | 4.1% | 2.930 |
| summarization | 68.0% | 39.0% | 21.0% | 10.8% | 5.4% | 2.6% | 1.2% | 2.480 |
| tool_call | 73.7% | 47.6% | 28.7% | 17.1% | 10.3% | 6.2% | 3.7% | 2.870 |
| translation | 73.8% | 47.7% | 28.7% | 17.3% | 10.4% | 6.5% | 4.1% | 2.890 |
| writing | 75.0% | 50.0% | 32.1% | 20.6% | 13.3% | 8.7% | 5.7% | 3.050 |
References
Paper: P-EAGLE: Parallel-Drafting EAGLE with Scalable Training
- Downloads last month
- 11