GLM-5.2-504B (REAP keep-168, NVFP4)
A 34%-pruned GLM-5.2 — 168 of 256 routed experts kept per layer (incl. the MTP layer), NVFP4-quantized, ~504B params, recovered via gate-only Router-KD to the unpruned teacher. This is the largest / highest-quality of the sibling cuts, and on a well-powered real-world eval it reaches parity with the full unpruned GLM-5.2.
🙏 Sponsor
Pruning, distillation, and evaluation ran on 8× NVIDIA B200 sponsored by Lambda. Thank you, Lambda. 🙏
What this is
- Arch:
GlmMoeDsaForCausalLM— 78 layers (3 dense + 75 MoE) + 1 MTP layer, DeepSeek Sparse Attention, sigmoid router (top-8), 1 shared expert, hidden 6144. - Prune: REAP (saliency =
gate × ‖expert_output‖) → top-168/layer, consistent across all MoE layers and the MTP layer;n_routed_experts: 168(loads cleanly in vLLM). - Quant: NVFP4 (modelopt) routed experts; BF16 router / attention / shared expert.
Recovery (Router-KD)
Freeze experts + backbone; train only the 75 router gates (~0.016% of params) to KL-match the unpruned GLM-5.2 teacher's next-token distribution (plain uniform weighting, lr 5e-5).
Eval — n=2000 held-out real prompts (raw sampling, no max_tokens / no timeout)
Loops are detected, not truncated. 2000 probes harvested from real coding-agent traces (codex, opencode, cursor, claude-code), held out from training.
| metric | keep-168 + Router-KD |
|---|---|
| attractor / loop rate | 0.072 |
| natural-EOS rate | 0.928 |
| output diversity (distinct-4) | 0.880 |
| median output length | 1267 tok |
At this scale the difference vs the unpruned teacher is within noise — i.e. parity, measured on
2000 samples (not the usual n=50). The residual loops are inherent to GLM-5.2 (the unpruned teacher
exhibits the same </think>-restart loops on the same prompts), so they're not a pruning artifact.
Serving (vLLM)
vllm serve 0xSero/GLM-5.2-504B --tensor-parallel-size 8 \
--quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 262144
Tip — brevity prompt: a system prompt like "Be concise. Think only as much as the task needs, then answer and stop." roughly halves median output length at no retraining cost.
More
- GGUF builds (BF16 + Q4/Q3/Q2 dynamic):
0xSero/GLM-5.2-REAP-504B-GGUF - Siblings:
0xSero/GLM-5.2-481B(keep-160),0xSero/GLM-5.2-469B(keep-156)
Compute sponsored by Lambda — thank you. 🙏
Honest note (n=2000)
The unpruned teacher loops on only 3.6% of these prompts vs ~7-8% for this pruned cut — REAP pruning roughly doubles the loop rate, and gate-only Router-KD (even on full data) does not close it. Earlier small-n evals suggesting parity were a sampling fluke. A knowledge-recovery LoRA is in progress to add capacity back.
- Downloads last month
- 126
Model tree for 0xSero/GLM-5.2-504B-FullKD
Base model
zai-org/GLM-5.2