GLM-5.2 DSpark speculator (preview)

DSpark (DFlash backbone + Markov logit-bias head + per-position confidence head) draft model for zai-org/GLM-5.2-FP8, trained with speculators.

Testing / preview checkpoint. A working first cut while we iterate on the recipe, expect a stronger replacement. Not a final release.

UPDATE Please use the full checkpoint here https://huggingface.co/RedHatAI/GLM-5.2-speculator.dspark

Training

Online training: the draft consumes hidden states streamed on-the-fly from a live GLM-5.2-FP8 vLLM server (TP4), with the trainer running data-parallel on the remaining GPUs. Used 8xB300 graciously provided by Verda Cloud.

Data: 50k UltraChat prompts regenerated by GLM-5.2-FP8 itself (its own reasoning + responses), seq_len=4096.
Recipe: 3 epochs, lr 6e-4, cosine schedule, SiLU; 5 draft layers, block_size=8, draft vocab 32000, aux layers [8, 23, 39, 55, 70].
Loss: {ce: 0.1, tv: 0.9} + confidence-head BCE.
max_anchors=1024.

Validation (epoch 3)

metric	value
mean accepted length	2.748
full accuracy	0.454
mean acceptance rate	0.411
confidence abs error	0.048

Per-position acceptance (positions 1-7): 0.711 / 0.560 / 0.464 / 0.407 / 0.369 / 0.342 / 0.320

The later-position accuracy holds up well (the Markov head's job against suffix decay).

Acceptance length in vLLM (HumanEval / math)

Measured end-to-end in vLLM MRV2 speculative decoding (using nightly with this PR https://github.com/vllm-project/vllm/pull/47093), serving the base model with this speculator:

uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

vllm serve zai-org/GLM-5.2-FP8 -tp 4 \
  --speculative-config '{"method":"dspark","model":"mgoin/GLM-5.2-speculator.dspark-preview","num_speculative_tokens":7,"attention_backend":"FLASH_ATTN","draft_sample_method":"probabilistic"}'

Prompts are from RedHatAI/speculator_benchmarks (80 HumanEval stubs, all 80 math_reasoning problems), one chat turn each. Per-position acceptance and mean accepted length are computed from vLLM's ground-truth spec_decode_num_accepted_tokens_per_pos / num_drafts counter deltas. Avg Length = 1 + Σ(positions) = mean tokens confirmed per target forward pass (1.0 = no speedup, up to 8.0 = all 7 drafts accepted).

Greedy (temperature=0)

Dataset	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Avg Length
HumanEval	64.3%	37.3%	18.2%	8.2%	3.6%	1.3%	0.5%	2.33
math_reasoning	78.2%	55.9%	34.3%	21.6%	12.9%	6.5%	3.5%	3.13

Default sampling (temperature=1.0, top_p=0.95)

Dataset	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Avg Length
HumanEval	62.4%	33.6%	15.1%	6.1%	2.3%	0.6%	0.2%	2.20
math_reasoning	75.4%	51.6%	29.3%	17.7%	10.1%	5.0%	2.6%	2.92

Downloads last month: 51

Safetensors

Model size

3B params

Tensor type

I64

BF16

BOOL

Model tree for RedHatAI/GLM-5.2-speculator.dspark-preview

Base model

zai-org/GLM-5.2-FP8

Finetuned

(3)

this model