GLM-5.2 DSpark speculator (preview)

DSpark (DFlash backbone + Markov logit-bias head + per-position confidence head) draft model for zai-org/GLM-5.2-FP8, trained with speculators.

Testing / preview checkpoint. A working first cut while we iterate on the recipe, expect a stronger replacement. Not a final release.

Training

Online training: the draft consumes hidden states streamed on-the-fly from a live GLM-5.2-FP8 vLLM server (TP4), with the trainer running data-parallel on the remaining GPUs. Used 8xB300 graciously provided by Verda Cloud.

  • Data: 50k UltraChat prompts regenerated by GLM-5.2-FP8 itself (its own reasoning + responses), seq_len=4096.
  • Recipe: 3 epochs, lr 6e-4, cosine schedule, SiLU; 5 draft layers, block_size=8, draft vocab 32000, aux layers [8, 23, 39, 55, 70].
  • Loss: {ce: 0.1, tv: 0.9} + confidence-head BCE.
  • max_anchors=1024.

Validation (epoch 3)

metric value
mean accepted length 2.748
full accuracy 0.454
mean acceptance rate 0.411
confidence abs error 0.048

Per-position acceptance (positions 1-7): 0.711 / 0.560 / 0.464 / 0.407 / 0.369 / 0.342 / 0.320

The later-position accuracy holds up well (the Markov head's job against suffix decay).

Acceptance length in vLLM (HumanEval / math)

Measured end-to-end in vLLM MRV2 speculative decoding (using nightly with this PR https://github.com/vllm-project/vllm/pull/47093), serving the base model with this speculator:

uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

vllm serve zai-org/GLM-5.2-FP8 -tp 4 \
  --speculative-config '{"method":"dspark","model":"mgoin/GLM-5.2-speculator.dspark-preview","num_speculative_tokens":7,"attention_backend":"FLASH_ATTN","draft_sample_method":"probabilistic"}'

Prompts are from RedHatAI/speculator_benchmarks (80 HumanEval stubs, all 80 math_reasoning problems), one chat turn each. Per-position acceptance and mean accepted length are computed from vLLM's ground-truth spec_decode_num_accepted_tokens_per_pos / num_drafts counter deltas. Avg Length = 1 + 危(positions) = mean tokens confirmed per target forward pass (1.0 = no speedup, up to 8.0 = all 7 drafts accepted).

Greedy (temperature=0)

Dataset Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6 Pos 7 Avg Length
HumanEval 64.3% 37.3% 18.2% 8.2% 3.6% 1.3% 0.5% 2.33
math_reasoning 78.2% 55.9% 34.3% 21.6% 12.9% 6.5% 3.5% 3.13

Default sampling (temperature=1.0, top_p=0.95)

Dataset Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6 Pos 7 Avg Length
HumanEval 62.4% 33.6% 15.1% 6.1% 2.3% 0.6% 0.2% 2.20
math_reasoning 75.4% 51.6% 29.3% 17.7% 10.1% 5.0% 2.6% 2.92
Downloads last month
-
Safetensors
Model size
3B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for RedHatAI/GLM-5.2-speculator.dspark-preview

Finetuned
(1)
this model