Qwen3-8B-DSpark

A DSpark speculative-decoding draft model for Qwen/Qwen3-8B, trained with TorchSpec.

DSpark combines a DFlash block-diffusion drafter with EAGLE-style Markov and confidence heads.

This repository contains only the draft model. It is used together with the unmodified Qwen/Qwen3-8B target model at inference time.

Benchmark

Run on vLLM built from: https://github.com/Dogacel/vllm/tree/dspark-local

Benchmark: SPEED-Bench coding, 1×H100 80GB, temperature=1.0, 2048-token outputs, draft_sample_method=probabilistic, num_speculative_tokens=7.

Throughput vs. no speculation

Same engine, same prompts; only the draft is added. Speculation helps most at low concurrency (latency-bound) and still helps when batched (compute-bound).

Regime	Concurrency	No spec	DSpark	Speedup
Single stream	1	147.4 tok/s	259.9 tok/s	1.76×
Batched	32	2994.8 tok/s	3875.6 tok/s	1.29×

Regime	No spec (ms/tok)	DSpark (ms/tok)
Mean TPOT @ c1	6.78	3.83
Mean TPOT @ c32	8.83	6.00

Acceptance (coding)

Metric	Value
Acceptance length	2.33
Acceptance rate	18.9%
Per-position accept (P0–P6)	60.5 / 34.1 / 18.6 / 9.9 / 5.2 / 2.7 / 1.4 %

Reference — DeepSeek's fully-trained dspark_qwen3_8b_block7, same harness: acceptance length 3.41.

Serve

# With DSpark speculation
vllm serve Qwen/Qwen3-8B \
  --speculative_config '{"method":"dspark","model":"Dogacel/Qwen3-8B-DSpark","num_speculative_tokens":7,"attention_backend":"FLASH_ATTN","draft_sample_method":"probabilistic"}' \
  --gpu-memory-utilization 0.8

# No-speculation baseline
vllm serve Qwen/Qwen3-8B --gpu-memory-utilization 0.8

Benchmark

# Download SPEED-Bench into the current directory first:
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 -

# Batched (concurrency 32, all coding prompts)
vllm bench serve --model Qwen/Qwen3-8B \
  --dataset-name speed_bench --dataset-path . --speed-bench-category coding \
  --num-prompts -1 --disable-shuffle --max-concurrency 32 \
  --temperature 1.0 --speed-bench-output-len 2048 \
  --backend openai-chat --endpoint /v1/chat/completions --skip-chat-template

# Single-stream (concurrency 1; fewer prompts to keep it short)
vllm bench serve --model Qwen/Qwen3-8B \
  --dataset-name speed_bench --dataset-path . --speed-bench-category coding \
  --num-prompts 16 --disable-shuffle --max-concurrency 1 \
  --temperature 1.0 --speed-bench-output-len 2048 \
  --backend openai-chat --endpoint /v1/chat/completions --skip-chat-template

Training

Training of this model is not fully complete, however this can serve as a checkpoint for verifying implementation of DSpark for various inference engines. For more details refer to the implementation PR: https://github.com/lightseekorg/TorchSpec/pull/129

Framework: TorchSpec
Config: configs/sglang_qwen3_8b_dspark.yaml
Inference backend: SGLang (streams target hidden states to the trainer)
Dataset: 50K-conversation blend (perfectblend_50k.jsonl), Qwen chat template

Key hyperparameters:

Hyperparameter	Value
Optimizer LR (cosine)	6e-4 → 6e-5
Warmup ratio	0.04
Max sequence length	2048
Epochs	3
TTT length	7
FSDP strategy	`FULL_SHARD`, bf16 reduce
DFlash block size	7
DSpark anchors / target layers	512 / 5
Loss decay gamma	4.0
CE / L1 / confidence loss weights	0.1 / 0.9 / 1.0

Reproduce

Training run on 4xB200.

Firstly follow instructions in https://github.com/lightseekorg/TorchSpec to setup TorchSpec.

python scripts/tools/prepare_perfectblend.py  --output data/perfectblend_50k.jsonl --sample-size 50000
python -m torchspec.train_entry --config configs/sglang_qwen3_8b_dspark.yaml \
  model.target_model_path=Qwen/Qwen3-8B \
  dataset.train_data_path=data/perfectblend_50k.jsonl

The released weights were converted from the FSDP checkpoint to HuggingFace format with:

python tools/convert_to_hf.py \
  --input-dir ./outputs/qwen3-8b-dspark-perfectblend/checkpoints/iter_0011803/ \
  --config torchspec/config/dspark_draft_config.json

License

Released under Apache-2.0, matching the Qwen/Qwen3-8B base model.

Downloads last month: 26

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dogacel/Qwen3-8B-DSpark

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1798)

this model