Qwen3.5-0.8B DFlash (la-draftery Phase 1.2)
A 5-layer DFlash speculative-decoding drafter for Qwen/Qwen3.5-0.8B,
trained in la-draftery
using the SGLang team's
SpecForge training framework
(via the mem-research/specforge dev/math-regen-sweep fork).
Headline numbers
| Metric | Value | Reference |
|---|---|---|
| Math500 mean acceptance length (offline) | 5.7095 | docs/012 |
| SGLang spec-v2 serving speedup (Math500/64) | 1.84x | docs/015 |
| Serving accept length | 5.34 | docs/015 |
| Prompts faster than target-only | 64 / 64 (min 1.28x, max 2.94x) | docs/015 |
| Quality parity (Math500 boxed + accuracy) | spec >= target | docs/015 |
Training
- Base target:
Qwen/Qwen3.5-0.8B(hybrid GDN, 24 layers, hidden 1024). - Training data:
Moonlight556/qwen3.5-0.8b-target-matched-math-240k(239,467 rows, target-matched Nemotron math), 1 epoch. - Hyperparameters: lr 3e-4, batch 2 / accum 2, max-length 3072, block-size 16, num-anchors 512, loss-decay-gamma 7, HF target backend, FSDP.
- Drafter architecture: 5-layer softmax transformer; reads target hidden states at layers
[3, 7, 11, 15, 19]; mask token id248077; block size 16. - Recipe in la-draftery:
recipes/train_phase1.2.sh.
How to use
Inference / benchmarking via la-draftery (recommended)
git clone https://github.com/aaronzhfeng/la-draftery
cd la-draftery && pip install -e .
# Download this drafter checkpoint
huggingface-cli download Moonlight556/Qwen3.5-0.8B-DFlash --local-dir ./ckpt/qwen35_0.8b_dflash
# Offline acceptance bench
python tools/bench/bench_dflash.py \
--config configs/bench_dflash_qwen3.5_0.8b.yaml \
--checkpoint ./ckpt/qwen35_0.8b_dflash
# Serving speedup bench (SGLang spec-v2)
bash tools/sglang_specv2_speedup_run.sh <GPU> ./ckpt/qwen35_0.8b_dflash ./out/speedup
Serving via SGLang directly
sglang launch_server \
--model-path Qwen/Qwen3.5-0.8B --trust-remote-code --dtype bfloat16 \
--attention-backend triton --linear-attn-backend triton \
--mamba-scheduler-strategy extra_buffer \
--speculative-algorithm DFLASH \
--speculative-draft-model-path Moonlight556/Qwen3.5-0.8B-DFlash \
--speculative-num-draft-tokens 16
Resume training from this checkpoint
huggingface-cli download Moonlight556/Qwen3.5-0.8B-DFlash --local-dir ./ckpt
# point la-draftery's recipe at this ckpt dir as the starting point;
# training_state.pt restores optimizer + LR + step state.
Files
model.safetensors(173 MB) โ the drafter weights.config.jsonโ model config (block_size, target_layer_ids, mask_token_id).dflash.pyโ the model implementation (loaded viaAutoModel'sauto_map).training_state.pt(692 MB) โ optimizer + scheduler + step state; only needed for resume.
Caveats
- DFlash "lossless" in this stack is distribution-level, not bit-identical. Output equivalence with target-only greedy is ~10-15% on Math500/64; quality (boxed rate, math accuracy) is preserved (
docs/008,docs/009,docs/015). - This is a 0.8B-target drafter โ bigger targets (Qwen3.5-4B+) typically achieve higher accept lengths in the published z-lab line (7.11+); see
docs/015for the offline-vs-serving analysis.
License
Apache-2.0 (same as the base target model). The training code is MIT (SGLang team, sgl-project/SpecForge + mem-research/specforge fork); see la-draftery LICENSE.
Citation
If you use this drafter, please credit the SGLang team's SpecForge / DFlash work and la-draftery.
- Downloads last month
- 26
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support