Qwen3.5-0.8B DFlash (la-draftery Phase 1.2)

A 5-layer DFlash speculative-decoding drafter for Qwen/Qwen3.5-0.8B, trained in la-draftery using the SGLang team's SpecForge training framework (via the mem-research/specforge dev/math-regen-sweep fork).

Headline numbers

Metric Value Reference
Math500 mean acceptance length (offline) 5.7095 docs/012
SGLang spec-v2 serving speedup (Math500/64) 1.84x docs/015
Serving accept length 5.34 docs/015
Prompts faster than target-only 64 / 64 (min 1.28x, max 2.94x) docs/015
Quality parity (Math500 boxed + accuracy) spec >= target docs/015

Training

  • Base target: Qwen/Qwen3.5-0.8B (hybrid GDN, 24 layers, hidden 1024).
  • Training data: Moonlight556/qwen3.5-0.8b-target-matched-math-240k (239,467 rows, target-matched Nemotron math), 1 epoch.
  • Hyperparameters: lr 3e-4, batch 2 / accum 2, max-length 3072, block-size 16, num-anchors 512, loss-decay-gamma 7, HF target backend, FSDP.
  • Drafter architecture: 5-layer softmax transformer; reads target hidden states at layers [3, 7, 11, 15, 19]; mask token id 248077; block size 16.
  • Recipe in la-draftery: recipes/train_phase1.2.sh.

How to use

Inference / benchmarking via la-draftery (recommended)

git clone https://github.com/aaronzhfeng/la-draftery
cd la-draftery && pip install -e .

# Download this drafter checkpoint
huggingface-cli download Moonlight556/Qwen3.5-0.8B-DFlash --local-dir ./ckpt/qwen35_0.8b_dflash

# Offline acceptance bench
python tools/bench/bench_dflash.py \
  --config configs/bench_dflash_qwen3.5_0.8b.yaml \
  --checkpoint ./ckpt/qwen35_0.8b_dflash

# Serving speedup bench (SGLang spec-v2)
bash tools/sglang_specv2_speedup_run.sh <GPU> ./ckpt/qwen35_0.8b_dflash ./out/speedup

Serving via SGLang directly

sglang launch_server \
  --model-path Qwen/Qwen3.5-0.8B --trust-remote-code --dtype bfloat16 \
  --attention-backend triton --linear-attn-backend triton \
  --mamba-scheduler-strategy extra_buffer \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path Moonlight556/Qwen3.5-0.8B-DFlash \
  --speculative-num-draft-tokens 16

Resume training from this checkpoint

huggingface-cli download Moonlight556/Qwen3.5-0.8B-DFlash --local-dir ./ckpt
# point la-draftery's recipe at this ckpt dir as the starting point;
# training_state.pt restores optimizer + LR + step state.

Files

  • model.safetensors (173 MB) โ€” the drafter weights.
  • config.json โ€” model config (block_size, target_layer_ids, mask_token_id).
  • dflash.py โ€” the model implementation (loaded via AutoModel's auto_map).
  • training_state.pt (692 MB) โ€” optimizer + scheduler + step state; only needed for resume.

Caveats

  • DFlash "lossless" in this stack is distribution-level, not bit-identical. Output equivalence with target-only greedy is ~10-15% on Math500/64; quality (boxed rate, math accuracy) is preserved (docs/008, docs/009, docs/015).
  • This is a 0.8B-target drafter โ€” bigger targets (Qwen3.5-4B+) typically achieve higher accept lengths in the published z-lab line (7.11+); see docs/015 for the offline-vs-serving analysis.

License

Apache-2.0 (same as the base target model). The training code is MIT (SGLang team, sgl-project/SpecForge + mem-research/specforge fork); see la-draftery LICENSE.

Citation

If you use this drafter, please credit the SGLang team's SpecForge / DFlash work and la-draftery.

Downloads last month
26
Safetensors
Model size
86.5M params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Moonlight556/Qwen3.5-0.8B-DFlash

Finetuned
(225)
this model