Qwen3.5-0.8B DFlash (la-draftery Phase 1.2)

A 5-layer DFlash speculative-decoding drafter for Qwen/Qwen3.5-0.8B, trained in la-draftery using the SGLang team's SpecForge training framework (via the mem-research/specforge dev/math-regen-sweep fork).

Headline numbers

Metric	Value	Reference
Math500 mean acceptance length (offline)	5.7095	`docs/012`
SGLang spec-v2 serving speedup (Math500/64)	1.84x	`docs/015`
Serving accept length	5.34	docs/015
Prompts faster than target-only	64 / 64 (min 1.28x, max 2.94x)	docs/015
Quality parity (Math500 boxed + accuracy)	spec >= target	docs/015

Training

Base target: Qwen/Qwen3.5-0.8B (hybrid GDN, 24 layers, hidden 1024).
Training data: Moonlight556/qwen3.5-0.8b-target-matched-math-240k (239,467 rows, target-matched Nemotron math), 1 epoch.
Hyperparameters: lr 3e-4, batch 2 / accum 2, max-length 3072, block-size 16, num-anchors 512, loss-decay-gamma 7, HF target backend, FSDP.
Drafter architecture: 5-layer softmax transformer; reads target hidden states at layers [3, 7, 11, 15, 19]; mask token id 248077; block size 16.
Recipe in la-draftery: recipes/train_phase1.2.sh.

How to use

Inference / benchmarking via la-draftery (recommended)

git clone https://github.com/aaronzhfeng/la-draftery
cd la-draftery && pip install -e .

# Download this drafter checkpoint
huggingface-cli download Moonlight556/Qwen3.5-0.8B-DFlash --local-dir ./ckpt/qwen35_0.8b_dflash

# Offline acceptance bench
python tools/bench/bench_dflash.py \
  --config configs/bench_dflash_qwen3.5_0.8b.yaml \
  --checkpoint ./ckpt/qwen35_0.8b_dflash

# Serving speedup bench (SGLang spec-v2)
bash tools/sglang_specv2_speedup_run.sh <GPU> ./ckpt/qwen35_0.8b_dflash ./out/speedup

Serving via SGLang directly

sglang launch_server \
  --model-path Qwen/Qwen3.5-0.8B --trust-remote-code --dtype bfloat16 \
  --attention-backend triton --linear-attn-backend triton \
  --mamba-scheduler-strategy extra_buffer \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path Moonlight556/Qwen3.5-0.8B-DFlash \
  --speculative-num-draft-tokens 16

Resume training from this checkpoint

huggingface-cli download Moonlight556/Qwen3.5-0.8B-DFlash --local-dir ./ckpt
# point la-draftery's recipe at this ckpt dir as the starting point;
# training_state.pt restores optimizer + LR + step state.

Files

model.safetensors (173 MB) — the drafter weights.
config.json — model config (block_size, target_layer_ids, mask_token_id).
dflash.py — the model implementation (loaded via AutoModel's auto_map).
training_state.pt (692 MB) — optimizer + scheduler + step state; only needed for resume.

Caveats

DFlash "lossless" in this stack is distribution-level, not bit-identical. Output equivalence with target-only greedy is ~10-15% on Math500/64; quality (boxed rate, math accuracy) is preserved (docs/008, docs/009, docs/015).
This is a 0.8B-target drafter — bigger targets (Qwen3.5-4B+) typically achieve higher accept lengths in the published z-lab line (7.11+); see docs/015 for the offline-vs-serving analysis.

License

Apache-2.0 (same as the base target model). The training code is MIT (SGLang team, sgl-project/SpecForge + mem-research/specforge fork); see la-draftery LICENSE.

Citation

If you use this drafter, please credit the SGLang team's SpecForge / DFlash work and la-draftery.

Downloads last month: 26

Safetensors

Model size

86.5M params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Moonlight556/Qwen3.5-0.8B-DFlash

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(225)

this model