Confucius3-Math-DFlash (draft model)

A DFlash block-diffusion speculative-decoding draft model for netease-youdao/Confucius3-Math. Use it as the --speculative-config model to accelerate Confucius3-Math inference (especially single-stream / low-latency math reasoning).

Target model: netease-youdao/Confucius3-Math (Qwen2 arch, 48 layers, DeepSeek-R1-distill thinking format)
Draft: 5-layer DFlashDraftModel, block size 16, ~1.5B params, taps target hidden states from layers [1,12,23,34,45]
Trained with: SpecForge, D-PACE loss, 6 epochs

Results (acceptance length = mean tokens accepted per draft+verify step, thinking mode)

dataset	accept length	draft accept rate	tok/s (single stream)
GSM8K	5.47	30%	493
MATH-500	5.79	32%	526

Higher acceptance ⇒ more tokens emitted per target forward ⇒ larger speedup. Profiled on 1×H200, vLLM 0.22, temperature 0.

Usage (vLLM)

vllm serve netease-youdao/Confucius3-Math \
  --speculative-config '{"method": "dflash", "model": "noctuashap/Confucius3-Math-DFlash", "num_speculative_tokens": 15}' \
  --trust-remote-code

DFlash is supported in vLLM ≥ 0.20.1. --trust-remote-code is required (the draft is a custom DFlashDraftModel, included as dflash.py).

Training data

~148k math-leaning prompts (NuminaMath / MATH / GSM8K / OpenMathReasoning + some code/reasoning/general), regenerated by Confucius3-Math itself (thinking traces kept inline) so the draft matches the target's own output distribution. No correctness filtering (distribution matching, not correctness).

Built with Claude Code.

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for noctuashap/Confucius3-Math-DFlash

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

Finetuned

netease-youdao/Confucius3-Math

Finetuned

(1)

this model