Qwen3-8B-DSpark

A DSpark speculative-decoding draft model for Qwen/Qwen3-8B, trained with TorchSpec.

DSpark combines a DFlash block-diffusion drafter with EAGLE-style Markov and confidence heads.

This repository contains only the draft model. It is used together with the unmodified Qwen/Qwen3-8B target model at inference time.

Benchmark

Run on vLLM built from: https://github.com/Dogacel/vllm/tree/dspark-local

Benchmark: SPEED-Bench coding, 1×H100 80GB, temperature=1.0, 2048-token outputs, draft_sample_method=probabilistic, num_speculative_tokens=7.

Throughput vs. no speculation

Same engine, same prompts; only the draft is added. Speculation helps most at low concurrency (latency-bound) and still helps when batched (compute-bound).

Regime Concurrency No spec DSpark Speedup
Single stream 1 147.4 tok/s 259.9 tok/s 1.76×
Batched 32 2994.8 tok/s 3875.6 tok/s 1.29×
Regime No spec (ms/tok) DSpark (ms/tok)
Mean TPOT @ c1 6.78 3.83
Mean TPOT @ c32 8.83 6.00

Acceptance (coding)

Metric Value
Acceptance length 2.33
Acceptance rate 18.9%
Per-position accept (P0–P6) 60.5 / 34.1 / 18.6 / 9.9 / 5.2 / 2.7 / 1.4 %

Reference — DeepSeek's fully-trained dspark_qwen3_8b_block7, same harness: acceptance length 3.41.

Serve

# With DSpark speculation
vllm serve Qwen/Qwen3-8B \
  --speculative_config '{"method":"dspark","model":"Dogacel/Qwen3-8B-DSpark","num_speculative_tokens":7,"attention_backend":"FLASH_ATTN","draft_sample_method":"probabilistic"}' \
  --gpu-memory-utilization 0.8

# No-speculation baseline
vllm serve Qwen/Qwen3-8B --gpu-memory-utilization 0.8

Benchmark

# Download SPEED-Bench into the current directory first:
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 -

# Batched (concurrency 32, all coding prompts)
vllm bench serve --model Qwen/Qwen3-8B \
  --dataset-name speed_bench --dataset-path . --speed-bench-category coding \
  --num-prompts -1 --disable-shuffle --max-concurrency 32 \
  --temperature 1.0 --speed-bench-output-len 2048 \
  --backend openai-chat --endpoint /v1/chat/completions --skip-chat-template

# Single-stream (concurrency 1; fewer prompts to keep it short)
vllm bench serve --model Qwen/Qwen3-8B \
  --dataset-name speed_bench --dataset-path . --speed-bench-category coding \
  --num-prompts 16 --disable-shuffle --max-concurrency 1 \
  --temperature 1.0 --speed-bench-output-len 2048 \
  --backend openai-chat --endpoint /v1/chat/completions --skip-chat-template

Training

Training of this model is not fully complete, however this can serve as a checkpoint for verifying implementation of DSpark for various inference engines. For more details refer to the implementation PR: https://github.com/lightseekorg/TorchSpec/pull/129

Key hyperparameters:

Hyperparameter Value
Optimizer LR (cosine) 6e-4 → 6e-5
Warmup ratio 0.04
Max sequence length 2048
Epochs 3
TTT length 7
FSDP strategy FULL_SHARD, bf16 reduce
DFlash block size 7
DSpark anchors / target layers 512 / 5
Loss decay gamma 4.0
CE / L1 / confidence loss weights 0.1 / 0.9 / 1.0

Reproduce

Training run on 4xB200.

Firstly follow instructions in https://github.com/lightseekorg/TorchSpec to setup TorchSpec.

python scripts/tools/prepare_perfectblend.py  --output data/perfectblend_50k.jsonl --sample-size 50000
python -m torchspec.train_entry --config configs/sglang_qwen3_8b_dspark.yaml \
  model.target_model_path=Qwen/Qwen3-8B \
  dataset.train_data_path=data/perfectblend_50k.jsonl

The released weights were converted from the FSDP checkpoint to HuggingFace format with:

python tools/convert_to_hf.py \
  --input-dir ./outputs/qwen3-8b-dspark-perfectblend/checkpoints/iter_0011803/ \
  --config torchspec/config/dspark_draft_config.json

License

Released under Apache-2.0, matching the Qwen/Qwen3-8B base model.

Downloads last month
26
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dogacel/Qwen3-8B-DSpark

Finetuned
Qwen/Qwen3-8B
Finetuned
(1798)
this model