GLM-5.2-504B-K — knowledge-augmented REAP keep-168 (full-data Router-KD, NVFP4)
The "K-cut" sibling of 0xSero/GLM-5.2-504B: the
same 504B / keep-168 budget, but the expert selection is biased toward knowledge & reasoning —
the winning top-160 core (kept bit-for-bit) plus the 8 highest-priority knowledge-exclusive experts
per layer that coding-saliency pruning drops. Recovered with gate-only **Router-KD trained on the
FULL calibration set (18.6k real traces)** — 6x the data of the first-pass cuts.
Sponsor
8x NVIDIA B200 sponsored by Lambda. Thank you.
Why this variant exists
REAP saliency computed from coding traces under-weights experts that fire mainly on
reasoning/knowledge. The K-cut deliberately re-includes them — trading a sliver of coding-saliency
coverage for broader knowledge coverage. Reach for this on knowledge/reasoning-heavy workloads; use
the plain GLM-5.2-504B otherwise.
Eval (n=2000 held-out real prompts, raw, no max_tokens / no timeout)
| metric | GLM-5.2-504B-K (this) | GLM-5.2-504B (plain floor) |
|---|---|---|
| attractor / loop rate | 0.078 | 0.072 |
| natural-EOS rate | 0.923 | 0.928 |
| distinct-4 | 0.881 | 0.880 |
| median tokens | 1232 | 1267 |
Serving (vLLM)
vllm serve 0xSero/GLM-5.2-504B-K --tensor-parallel-size 8 \
--quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 262144
REAP knowledge-augmented cut + full-data Router-KD. Compute sponsored by Lambda.
Honest note (n=2000)
The unpruned teacher loops on only 3.6% of these prompts vs ~7-8% for this pruned cut — REAP pruning roughly doubles the loop rate, and gate-only Router-KD (even on full data) does not close it. Earlier small-n evals suggesting parity were a sampling fluke. A knowledge-recovery LoRA is in progress to add capacity back.
- Downloads last month
- 126
Model tree for 0xSero/GLM-5.2-504B-K
Base model
zai-org/GLM-5.2