DSpark-DFlash Draft Head for Qwen3.6-27B-AEON
A speculative-decoding draft head for the Qwen3.6-27B-AEON family, built by reproducing and adapting DeepSeek's DSpark recipe on top of z-lab's public DFlash block-diffusion drafter. It combines two drafting paths:
- DFlash block-diffusion backbone โ the public
z-lab/Qwen3.6-27B-DFlashhead (MIT), fine-tuned on on-policy AEON traces. - VanillaMarkov sequential head โ a rank-256 head with a bigram bias term, added on the DSpark-style semi-autoregressive drafting path.
The head is distilled against the self-generated (on-policy) logits of a
Qwen3.6-27B-AEON target, so it is specialized to that target family rather than to
stock Qwen/Qwen3.6-27B.
Results
All numbers below are reported with their measurement conditions. measured
means observed on the described harness; nothing here is extrapolated.
Offline paired acceptance (draft quality)
Paired evaluation on on-policy AEON-generated text (the draft head scores the same target continuations under both drafters), n = 4052 anchors / 176 sequences, cluster-bootstrap 95% CI, eval K = 8.
Versus the stock z-lab DFlash head:
| Metric | This head vs z-lab DFlash |
|---|---|
| Accept, T = 1.0 sampling | +32.4% relative [CI +27%, +38%] |
| Accept, greedy | +17.7% relative |
Per-domain absolute accepted-length gain (all CIs exclude 0):
| Domain | ฮ accepted length |
|---|---|
| toolcall | +2.83 |
| chat | +0.80 |
| code | +0.57 |
| math | +0.50 |
End-to-end serving throughput (measured)
vLLM 0.23.0, ABBA ร 3 rounds, K = 8, T = 1.0, NVFP4 target, single RTX PRO 6000.
| This head | z-lab DFlash | ฮ | |
|---|---|---|---|
| Aggregate throughput | 194.8 tok/s | 175.5 tok/s | +11.0% [CI +13.6, +26.5 tok/s] |
| Accept rate | 0.420 | 0.342 | โ |
Per-domain throughput gain (all CIs exclude 0):
| Domain | ฮ tok/s |
|---|---|
| code | +15.0% |
| toolcall | +14.1% |
| chat | +8.1% |
| math | +7.1% |
Training
| Setting | Value |
|---|---|
| Loss | L1 distribution loss (0.9) + teacher-argmax CE (0.1) |
loss_decay_gamma |
6.0 |
block_size |
11 |
max_context |
1024 |
| Anchors / sequence | 32 |
| Learning rate | 6e-4, cosine schedule |
| Steps | 6000 (converged at ~4500 in practice) |
| Head dtype | bf16 |
| Target data | target self-generated, on-policy |
Corpus โ coding/toolcall-heavy mix, 15,936 sequences total:
- 40% toolcall (AEON self-play)
- 25% real agent sessions (tool-use traces)
- 35% general (of which 57% is code)
Training / eval / bench code: the full recipe (training pipeline, paired-bootstrap eval harness, ABBA serve benchmark, demo) is published at github.com/hikarioyama/dspark-aeon-27b.
Usage
Target model
Designed for the Qwen3.6-27B-AEON family (vocab 248320). Verified against:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored(BF16)AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4(NVFP4, used for the serving numbers above)
More broadly, this head is compatible with Qwen3.6-27B-AEON merges (vocab 248320).
It is distilled to AEON logits and is not intended as a drop-in drafter for stock
Qwen/Qwen3.6-27B.
Install the vLLM patches (required)
This head uses a Markov semi-autoregressive drafting path that stock vLLM does not implement, so the two bundled patch files are required. They are written against vLLM 0.23.0 โ do not apply them to other versions.
Overwrite-copy the two files from vllm_patches/ in this repo into your vLLM install:
# from the root of this repo, into your vLLM 0.23.0 site-packages
cp vllm_patches/qwen3_dflash.py "$VLLM/vllm/model_executor/models/"
cp vllm_patches/llm_base_proposer.py "$VLLM/vllm/v1/spec_decode/"
($VLLM = the directory containing your installed vllm package.)
Serve
vllm serve <target> \
--speculative-config '{"method":"dflash","model":"<this repo>","num_speculative_tokens":8,"draft_sample_method":"probabilistic"}' \
--mamba-cache-dtype float32 \
--attention-backend flash_attn
Replace <target> with an AEON target (e.g.
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4) and <this repo> with this
draft-head repo id.
Optional environment variables
DSPARK_MARKOV_TOPNโ experimental. Truncates the Markov head to its top-N candidates to cut per-step overhead in single-stream serving. It measurably lowers acceptance on toolcall, so it is off by default and recommended to leave off.
Honest limitations
- Target-specific. The head is distilled to the logits of a Qwen3.6-27B-AEON
merge. Gains do not transfer to unrelated targets, including stock
Qwen/Qwen3.6-27B. - Evaluated at K = 8. The block size is 11 (supports K โค 10); all reported acceptance and throughput numbers are at K = 8 and are not guaranteed to hold at other draft lengths.
- Single-GPU numbers. Serving throughput was measured on one RTX PRO 6000 with the NVFP4 target; other hardware, batch regimes, or target quantizations will differ.
- Sampling is the strong regime. The largest wins are under T = 1.0 sampling (accept +32.4% offline) rather than greedy (+17.7% offline). Greedy-heavy workloads will see smaller gains.
Acknowledgements
- z-lab โ the DFlash block-diffusion drafter
(
z-lab/Qwen3.6-27B-DFlash, MIT), which this head fine-tunes. DFlash: Block Diffusion for Flash Speculative Decoding (arXiv:2602.06036). - DeepSeek โ the DSpark paper and the DeepSpec reference implementation, whose recipe this work reproduces and adapts for the Qwen3.6-27B-AEON target.
License
MIT (inherited from the z-lab DFlash head).
- Downloads last month
- 43
Model tree for Hikari07jp/DSpark-Qwen3.6-27B-AEON-draft
Base model
z-lab/Qwen3.6-27B-DFlash