GLM-5.2 DSpark speculator (preview)
DSpark (DFlash backbone + Markov logit-bias head + per-position confidence head)
draft model for zai-org/GLM-5.2-FP8, trained with
speculators.
Testing / preview checkpoint. A working first cut while we iterate on the recipe, expect a stronger replacement. Not a final release.
Training
Online training: the draft consumes hidden states streamed on-the-fly from a live GLM-5.2-FP8 vLLM server (TP4), with the trainer running data-parallel on the remaining GPUs. Used 8xB300 graciously provided by Verda Cloud.
- Data: 50k UltraChat prompts regenerated by GLM-5.2-FP8 itself (its own
reasoning + responses),
seq_len=4096. - Recipe: 3 epochs, lr 6e-4, cosine schedule, SiLU; 5 draft layers,
block_size=8, draft vocab 32000, aux layers[8, 23, 39, 55, 70]. - Loss:
{ce: 0.1, tv: 0.9}+ confidence-head BCE. max_anchors=1024.
Validation (epoch 3)
| metric | value |
|---|---|
| mean accepted length | 2.748 |
| full accuracy | 0.454 |
| mean acceptance rate | 0.411 |
| confidence abs error | 0.048 |
Per-position acceptance (positions 1-7):
0.711 / 0.560 / 0.464 / 0.407 / 0.369 / 0.342 / 0.320
The later-position accuracy holds up well (the Markov head's job against suffix decay).
Acceptance length in vLLM (HumanEval / math)
Measured end-to-end in vLLM MRV2 speculative decoding (using nightly with this PR https://github.com/vllm-project/vllm/pull/47093), serving the base model with this speculator:
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
vllm serve zai-org/GLM-5.2-FP8 -tp 4 \
--speculative-config '{"method":"dspark","model":"mgoin/GLM-5.2-speculator.dspark-preview","num_speculative_tokens":7,"attention_backend":"FLASH_ATTN","draft_sample_method":"probabilistic"}'
Prompts are from
RedHatAI/speculator_benchmarks
(80 HumanEval stubs, all 80 math_reasoning problems), one chat turn each.
Per-position acceptance and mean accepted length are computed from vLLM's
ground-truth spec_decode_num_accepted_tokens_per_pos / num_drafts counter
deltas. Avg Length = 1 + 危(positions) = mean tokens confirmed per target forward
pass (1.0 = no speedup, up to 8.0 = all 7 drafts accepted).
Greedy (temperature=0)
| Dataset | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg Length |
|---|---|---|---|---|---|---|---|---|
| HumanEval | 64.3% | 37.3% | 18.2% | 8.2% | 3.6% | 1.3% | 0.5% | 2.33 |
| math_reasoning | 78.2% | 55.9% | 34.3% | 21.6% | 12.9% | 6.5% | 3.5% | 3.13 |
Default sampling (temperature=1.0, top_p=0.95)
| Dataset | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg Length |
|---|---|---|---|---|---|---|---|---|
| HumanEval | 62.4% | 33.6% | 15.1% | 6.1% | 2.3% | 0.6% | 0.2% | 2.20 |
| math_reasoning | 75.4% | 51.6% | 29.3% | 17.7% | 10.1% | 5.0% | 2.6% | 2.92 |
- Downloads last month
- -
Model tree for RedHatAI/GLM-5.2-speculator.dspark-preview
Base model
zai-org/GLM-5.2-FP8