DSpark-DFlash Draft Head for Qwen3.6-27B-AEON

A speculative-decoding draft head for the Qwen3.6-27B-AEON family, built by reproducing and adapting DeepSeek's DSpark recipe on top of z-lab's public DFlash block-diffusion drafter. It combines two drafting paths:

  1. DFlash block-diffusion backbone โ€” the public z-lab/Qwen3.6-27B-DFlash head (MIT), fine-tuned on on-policy AEON traces.
  2. VanillaMarkov sequential head โ€” a rank-256 head with a bigram bias term, added on the DSpark-style semi-autoregressive drafting path.

The head is distilled against the self-generated (on-policy) logits of a Qwen3.6-27B-AEON target, so it is specialized to that target family rather than to stock Qwen/Qwen3.6-27B.

Learning curve


Results

All numbers below are reported with their measurement conditions. measured means observed on the described harness; nothing here is extrapolated.

Offline paired acceptance (draft quality)

Paired evaluation on on-policy AEON-generated text (the draft head scores the same target continuations under both drafters), n = 4052 anchors / 176 sequences, cluster-bootstrap 95% CI, eval K = 8.

Versus the stock z-lab DFlash head:

Metric This head vs z-lab DFlash
Accept, T = 1.0 sampling +32.4% relative [CI +27%, +38%]
Accept, greedy +17.7% relative

Per-domain absolute accepted-length gain (all CIs exclude 0):

Domain ฮ” accepted length
toolcall +2.83
chat +0.80
code +0.57
math +0.50

End-to-end serving throughput (measured)

vLLM 0.23.0, ABBA ร— 3 rounds, K = 8, T = 1.0, NVFP4 target, single RTX PRO 6000.

This head z-lab DFlash ฮ”
Aggregate throughput 194.8 tok/s 175.5 tok/s +11.0% [CI +13.6, +26.5 tok/s]
Accept rate 0.420 0.342 โ€”

Per-domain throughput gain (all CIs exclude 0):

Domain ฮ” tok/s
code +15.0%
toolcall +14.1%
chat +8.1%
math +7.1%

Training

Setting Value
Loss L1 distribution loss (0.9) + teacher-argmax CE (0.1)
loss_decay_gamma 6.0
block_size 11
max_context 1024
Anchors / sequence 32
Learning rate 6e-4, cosine schedule
Steps 6000 (converged at ~4500 in practice)
Head dtype bf16
Target data target self-generated, on-policy

Corpus โ€” coding/toolcall-heavy mix, 15,936 sequences total:

  • 40% toolcall (AEON self-play)
  • 25% real agent sessions (tool-use traces)
  • 35% general (of which 57% is code)

Training / eval / bench code: the full recipe (training pipeline, paired-bootstrap eval harness, ABBA serve benchmark, demo) is published at github.com/hikarioyama/dspark-aeon-27b.

Usage

Target model

Designed for the Qwen3.6-27B-AEON family (vocab 248320). Verified against:

More broadly, this head is compatible with Qwen3.6-27B-AEON merges (vocab 248320). It is distilled to AEON logits and is not intended as a drop-in drafter for stock Qwen/Qwen3.6-27B.

Install the vLLM patches (required)

This head uses a Markov semi-autoregressive drafting path that stock vLLM does not implement, so the two bundled patch files are required. They are written against vLLM 0.23.0 โ€” do not apply them to other versions.

Overwrite-copy the two files from vllm_patches/ in this repo into your vLLM install:

# from the root of this repo, into your vLLM 0.23.0 site-packages
cp vllm_patches/qwen3_dflash.py       "$VLLM/vllm/model_executor/models/"
cp vllm_patches/llm_base_proposer.py  "$VLLM/vllm/v1/spec_decode/"

($VLLM = the directory containing your installed vllm package.)

Serve

vllm serve <target> \
  --speculative-config '{"method":"dflash","model":"<this repo>","num_speculative_tokens":8,"draft_sample_method":"probabilistic"}' \
  --mamba-cache-dtype float32 \
  --attention-backend flash_attn

Replace <target> with an AEON target (e.g. AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4) and <this repo> with this draft-head repo id.

Optional environment variables

  • DSPARK_MARKOV_TOPN โ€” experimental. Truncates the Markov head to its top-N candidates to cut per-step overhead in single-stream serving. It measurably lowers acceptance on toolcall, so it is off by default and recommended to leave off.

Honest limitations

  • Target-specific. The head is distilled to the logits of a Qwen3.6-27B-AEON merge. Gains do not transfer to unrelated targets, including stock Qwen/Qwen3.6-27B.
  • Evaluated at K = 8. The block size is 11 (supports K โ‰ค 10); all reported acceptance and throughput numbers are at K = 8 and are not guaranteed to hold at other draft lengths.
  • Single-GPU numbers. Serving throughput was measured on one RTX PRO 6000 with the NVFP4 target; other hardware, batch regimes, or target quantizations will differ.
  • Sampling is the strong regime. The largest wins are under T = 1.0 sampling (accept +32.4% offline) rather than greedy (+17.7% offline). Greedy-heavy workloads will see smaller gains.

Acknowledgements

  • z-lab โ€” the DFlash block-diffusion drafter (z-lab/Qwen3.6-27B-DFlash, MIT), which this head fine-tunes. DFlash: Block Diffusion for Flash Speculative Decoding (arXiv:2602.06036).
  • DeepSeek โ€” the DSpark paper and the DeepSpec reference implementation, whose recipe this work reproduces and adapts for the Qwen3.6-27B-AEON target.

License

MIT (inherited from the z-lab DFlash head).

Downloads last month
43
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Hikari07jp/DSpark-Qwen3.6-27B-AEON-draft

Finetuned
(4)
this model

Paper for Hikari07jp/DSpark-Qwen3.6-27B-AEON-draft