GLM-5.2 DSpark speculator (preview)

DSpark (DFlash backbone + Markov logit-bias head + per-position confidence head) draft model for zai-org/GLM-5.2-FP8, trained with speculators.

Testing / preview checkpoint. A working first cut while we iterate on the recipe, expect a stronger replacement. Not a final release.

UPDATE Please use the full checkpoint here https://huggingface.co/RedHatAI/GLM-5.2-speculator.dspark

Training

Online training: the draft consumes hidden states streamed on-the-fly from a live GLM-5.2-FP8 vLLM server (TP4), with the trainer running data-parallel on the remaining GPUs. Used 8xB300 graciously provided by Verda Cloud.

  • Data: 50k UltraChat prompts regenerated by GLM-5.2-FP8 itself (its own reasoning + responses), seq_len=4096.
  • Recipe: 3 epochs, lr 6e-4, cosine schedule, SiLU; 5 draft layers, block_size=8, draft vocab 32000, aux layers [8, 23, 39, 55, 70].
  • Loss: {ce: 0.1, tv: 0.9} + confidence-head BCE.
  • max_anchors=1024.

Validation (epoch 3)

metric value
mean accepted length 2.748
full accuracy 0.454
mean acceptance rate 0.411
confidence abs error 0.048

Per-position acceptance (positions 1-7): 0.711 / 0.560 / 0.464 / 0.407 / 0.369 / 0.342 / 0.320

The later-position accuracy holds up well (the Markov head's job against suffix decay).

Acceptance length in vLLM (HumanEval / math)

Measured end-to-end in vLLM MRV2 speculative decoding (using nightly with this PR https://github.com/vllm-project/vllm/pull/47093), serving the base model with this speculator:

uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

vllm serve zai-org/GLM-5.2-FP8 -tp 4 \
  --speculative-config '{"method":"dspark","model":"mgoin/GLM-5.2-speculator.dspark-preview","num_speculative_tokens":7,"attention_backend":"FLASH_ATTN","draft_sample_method":"probabilistic"}'

Prompts are from RedHatAI/speculator_benchmarks (80 HumanEval stubs, all 80 math_reasoning problems), one chat turn each. Per-position acceptance and mean accepted length are computed from vLLM's ground-truth spec_decode_num_accepted_tokens_per_pos / num_drafts counter deltas. Avg Length = 1 + 危(positions) = mean tokens confirmed per target forward pass (1.0 = no speedup, up to 8.0 = all 7 drafts accepted).

Greedy (temperature=0)

Dataset Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6 Pos 7 Avg Length
HumanEval 64.3% 37.3% 18.2% 8.2% 3.6% 1.3% 0.5% 2.33
math_reasoning 78.2% 55.9% 34.3% 21.6% 12.9% 6.5% 3.5% 3.13

Default sampling (temperature=1.0, top_p=0.95)

Dataset Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6 Pos 7 Avg Length
HumanEval 62.4% 33.6% 15.1% 6.1% 2.3% 0.6% 0.2% 2.20
math_reasoning 75.4% 51.6% 29.3% 17.7% 10.1% 5.0% 2.6% 2.92
Downloads last month
51
Safetensors
Model size
3B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for RedHatAI/GLM-5.2-speculator.dspark-preview

Finetuned
(3)
this model