Qwen3-8B-DSpark
A DSpark speculative-decoding draft model for Qwen/Qwen3-8B, trained with TorchSpec.
DSpark combines a DFlash block-diffusion drafter with EAGLE-style Markov and confidence heads.
This repository contains only the draft model. It is used together with the unmodified Qwen/Qwen3-8B target model at inference time.
Benchmark
Run on vLLM built from: https://github.com/Dogacel/vllm/tree/dspark-local
Benchmark: SPEED-Bench coding, 1×H100 80GB, temperature=1.0, 2048-token outputs, draft_sample_method=probabilistic, num_speculative_tokens=7.
Throughput vs. no speculation
Same engine, same prompts; only the draft is added. Speculation helps most at low concurrency (latency-bound) and still helps when batched (compute-bound).
| Regime | Concurrency | No spec | DSpark | Speedup |
|---|---|---|---|---|
| Single stream | 1 | 147.4 tok/s | 259.9 tok/s | 1.76× |
| Batched | 32 | 2994.8 tok/s | 3875.6 tok/s | 1.29× |
| Regime | No spec (ms/tok) | DSpark (ms/tok) |
|---|---|---|
| Mean TPOT @ c1 | 6.78 | 3.83 |
| Mean TPOT @ c32 | 8.83 | 6.00 |
Acceptance (coding)
| Metric | Value |
|---|---|
| Acceptance length | 2.33 |
| Acceptance rate | 18.9% |
| Per-position accept (P0–P6) | 60.5 / 34.1 / 18.6 / 9.9 / 5.2 / 2.7 / 1.4 % |
Reference — DeepSeek's fully-trained dspark_qwen3_8b_block7, same harness: acceptance length 3.41.
Serve
# With DSpark speculation
vllm serve Qwen/Qwen3-8B \
--speculative_config '{"method":"dspark","model":"Dogacel/Qwen3-8B-DSpark","num_speculative_tokens":7,"attention_backend":"FLASH_ATTN","draft_sample_method":"probabilistic"}' \
--gpu-memory-utilization 0.8
# No-speculation baseline
vllm serve Qwen/Qwen3-8B --gpu-memory-utilization 0.8
Benchmark
# Download SPEED-Bench into the current directory first:
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 -
# Batched (concurrency 32, all coding prompts)
vllm bench serve --model Qwen/Qwen3-8B \
--dataset-name speed_bench --dataset-path . --speed-bench-category coding \
--num-prompts -1 --disable-shuffle --max-concurrency 32 \
--temperature 1.0 --speed-bench-output-len 2048 \
--backend openai-chat --endpoint /v1/chat/completions --skip-chat-template
# Single-stream (concurrency 1; fewer prompts to keep it short)
vllm bench serve --model Qwen/Qwen3-8B \
--dataset-name speed_bench --dataset-path . --speed-bench-category coding \
--num-prompts 16 --disable-shuffle --max-concurrency 1 \
--temperature 1.0 --speed-bench-output-len 2048 \
--backend openai-chat --endpoint /v1/chat/completions --skip-chat-template
Training
Training of this model is not fully complete, however this can serve as a checkpoint for verifying implementation of DSpark for various inference engines. For more details refer to the implementation PR: https://github.com/lightseekorg/TorchSpec/pull/129
- Framework: TorchSpec
- Config:
configs/sglang_qwen3_8b_dspark.yaml - Inference backend: SGLang (streams target hidden states to the trainer)
- Dataset: 50K-conversation blend (
perfectblend_50k.jsonl), Qwen chat template
Key hyperparameters:
| Hyperparameter | Value |
|---|---|
| Optimizer LR (cosine) | 6e-4 → 6e-5 |
| Warmup ratio | 0.04 |
| Max sequence length | 2048 |
| Epochs | 3 |
| TTT length | 7 |
| FSDP strategy | FULL_SHARD, bf16 reduce |
| DFlash block size | 7 |
| DSpark anchors / target layers | 512 / 5 |
| Loss decay gamma | 4.0 |
| CE / L1 / confidence loss weights | 0.1 / 0.9 / 1.0 |
Reproduce
Training run on 4xB200.
Firstly follow instructions in https://github.com/lightseekorg/TorchSpec to setup TorchSpec.
python scripts/tools/prepare_perfectblend.py --output data/perfectblend_50k.jsonl --sample-size 50000
python -m torchspec.train_entry --config configs/sglang_qwen3_8b_dspark.yaml \
model.target_model_path=Qwen/Qwen3-8B \
dataset.train_data_path=data/perfectblend_50k.jsonl
The released weights were converted from the FSDP checkpoint to HuggingFace format with:
python tools/convert_to_hf.py \
--input-dir ./outputs/qwen3-8b-dspark-perfectblend/checkpoints/iter_0011803/ \
--config torchspec/config/dspark_draft_config.json
License
Released under Apache-2.0, matching the Qwen/Qwen3-8B base model.
- Downloads last month
- 26