DSpark-DFlash Draft Head for Qwen3.6-27B-AEON

A speculative-decoding draft head for the Qwen3.6-27B-AEON family, built by reproducing and adapting DeepSeek's DSpark recipe on top of z-lab's public DFlash block-diffusion drafter. It combines two drafting paths:

DFlash block-diffusion backbone — the public z-lab/Qwen3.6-27B-DFlash head (MIT), fine-tuned on on-policy AEON traces.
VanillaMarkov sequential head — a rank-256 head with a bigram bias term, added on the DSpark-style semi-autoregressive drafting path.

The head is distilled against the self-generated (on-policy) logits of a Qwen3.6-27B-AEON target, so it is specialized to that target family rather than to stock Qwen/Qwen3.6-27B.

Results

All numbers below are reported with their measurement conditions. measured means observed on the described harness; nothing here is extrapolated.

Offline paired acceptance (draft quality)

Paired evaluation on on-policy AEON-generated text (the draft head scores the same target continuations under both drafters), n = 4052 anchors / 176 sequences, cluster-bootstrap 95% CI, eval K = 8.

Versus the stock z-lab DFlash head:

Metric	This head vs z-lab DFlash
Accept, T = 1.0 sampling	+32.4% relative [CI +27%, +38%]
Accept, greedy	+17.7% relative

Per-domain absolute accepted-length gain (all CIs exclude 0):

Domain	Δ accepted length
toolcall	+2.83
chat	+0.80
code	+0.57
math	+0.50

End-to-end serving throughput (measured)

vLLM 0.23.0, ABBA × 3 rounds, K = 8, T = 1.0, NVFP4 target, single RTX PRO 6000.

	This head	z-lab DFlash	Δ
Aggregate throughput	194.8 tok/s	175.5 tok/s	+11.0% [CI +13.6, +26.5 tok/s]
Accept rate	0.420	0.342	—

Per-domain throughput gain (all CIs exclude 0):

Domain	Δ tok/s
code	+15.0%
toolcall	+14.1%
chat	+8.1%
math	+7.1%

Training

Setting	Value
Loss	L1 distribution loss (0.9) + teacher-argmax CE (0.1)
`loss_decay_gamma`	6.0
`block_size`	11
`max_context`	1024
Anchors / sequence	32
Learning rate	6e-4, cosine schedule
Steps	6000 (converged at ~4500 in practice)
Head dtype	bf16
Target data	target self-generated, on-policy

Corpus — coding/toolcall-heavy mix, 15,936 sequences total:

40% toolcall (AEON self-play)
25% real agent sessions (tool-use traces)
35% general (of which 57% is code)

Training / eval / bench code: the full recipe (training pipeline, paired-bootstrap eval harness, ABBA serve benchmark, demo) is published at github.com/hikarioyama/dspark-aeon-27b.

Usage

Target model

Designed for the Qwen3.6-27B-AEON family (vocab 248320). Verified against:

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored (BF16)
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4 (NVFP4, used for the serving numbers above)

More broadly, this head is compatible with Qwen3.6-27B-AEON merges (vocab 248320). It is distilled to AEON logits and is not intended as a drop-in drafter for stock Qwen/Qwen3.6-27B.

Install the vLLM patches (required)

This head uses a Markov semi-autoregressive drafting path that stock vLLM does not implement, so the two bundled patch files are required. They are written against vLLM 0.23.0 — do not apply them to other versions.

Overwrite-copy the two files from vllm_patches/ in this repo into your vLLM install:

# from the root of this repo, into your vLLM 0.23.0 site-packages
cp vllm_patches/qwen3_dflash.py       "$VLLM/vllm/model_executor/models/"
cp vllm_patches/llm_base_proposer.py  "$VLLM/vllm/v1/spec_decode/"

($VLLM = the directory containing your installed vllm package.)

Serve

vllm serve <target> \
  --speculative-config '{"method":"dflash","model":"<this repo>","num_speculative_tokens":8,"draft_sample_method":"probabilistic"}' \
  --mamba-cache-dtype float32 \
  --attention-backend flash_attn

Replace <target> with an AEON target (e.g. AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4) and <this repo> with this draft-head repo id.

Optional environment variables

DSPARK_MARKOV_TOPN — experimental. Truncates the Markov head to its top-N candidates to cut per-step overhead in single-stream serving. It measurably lowers acceptance on toolcall, so it is off by default and recommended to leave off.

Honest limitations

Target-specific. The head is distilled to the logits of a Qwen3.6-27B-AEON merge. Gains do not transfer to unrelated targets, including stock Qwen/Qwen3.6-27B.
Evaluated at K = 8. The block size is 11 (supports K ≤ 10); all reported acceptance and throughput numbers are at K = 8 and are not guaranteed to hold at other draft lengths.
Single-GPU numbers. Serving throughput was measured on one RTX PRO 6000 with the NVFP4 target; other hardware, batch regimes, or target quantizations will differ.
Sampling is the strong regime. The largest wins are under T = 1.0 sampling (accept +32.4% offline) rather than greedy (+17.7% offline). Greedy-heavy workloads will see smaller gains.

Acknowledgements

z-lab — the DFlash block-diffusion drafter (z-lab/Qwen3.6-27B-DFlash, MIT), which this head fine-tunes. DFlash: Block Diffusion for Flash Speculative Decoding (arXiv:2602.06036).
DeepSeek — the DSpark paper and the DeepSpec reference implementation, whose recipe this work reproduces and adapts for the Qwen3.6-27B-AEON target.

License

MIT (inherited from the z-lab DFlash head).

Downloads last month: 43

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Hikari07jp/DSpark-Qwen3.6-27B-AEON-draft

Base model

z-lab/Qwen3.6-27B-DFlash

Finetuned

(4)

this model

Paper for Hikari07jp/DSpark-Qwen3.6-27B-AEON-draft

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 89