GLM-5.2-504B (REAP keep-168, NVFP4)

A 34%-pruned GLM-5.2 — 168 of 256 routed experts kept per layer (incl. the MTP layer), NVFP4-quantized, ~504B params, recovered via gate-only Router-KD to the unpruned teacher. This is the largest / highest-quality of the sibling cuts, and on a well-powered real-world eval it reaches parity with the full unpruned GLM-5.2.

🙏 Sponsor

Pruning, distillation, and evaluation ran on 8× NVIDIA B200 sponsored by Lambda. Thank you, Lambda. 🙏

What this is

  • Arch: GlmMoeDsaForCausalLM — 78 layers (3 dense + 75 MoE) + 1 MTP layer, DeepSeek Sparse Attention, sigmoid router (top-8), 1 shared expert, hidden 6144.
  • Prune: REAP (saliency = gate × ‖expert_output‖) → top-168/layer, consistent across all MoE layers and the MTP layer; n_routed_experts: 168 (loads cleanly in vLLM).
  • Quant: NVFP4 (modelopt) routed experts; BF16 router / attention / shared expert.

Recovery (Router-KD)

Freeze experts + backbone; train only the 75 router gates (~0.016% of params) to KL-match the unpruned GLM-5.2 teacher's next-token distribution (plain uniform weighting, lr 5e-5).

Eval — n=2000 held-out real prompts (raw sampling, no max_tokens / no timeout)

Loops are detected, not truncated. 2000 probes harvested from real coding-agent traces (codex, opencode, cursor, claude-code), held out from training.

metric keep-168 + Router-KD
attractor / loop rate 0.072
natural-EOS rate 0.928
output diversity (distinct-4) 0.880
median output length 1267 tok

At this scale the difference vs the unpruned teacher is within noise — i.e. parity, measured on 2000 samples (not the usual n=50). The residual loops are inherent to GLM-5.2 (the unpruned teacher exhibits the same </think>-restart loops on the same prompts), so they're not a pruning artifact.

Serving (vLLM)

vllm serve 0xSero/GLM-5.2-504B --tensor-parallel-size 8 \
  --quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 262144

Tip — brevity prompt: a system prompt like "Be concise. Think only as much as the task needs, then answer and stop." roughly halves median output length at no retraining cost.

More


Compute sponsored by Lambda — thank you. 🙏

Honest note (n=2000)

The unpruned teacher loops on only 3.6% of these prompts vs ~7-8% for this pruned cut — REAP pruning roughly doubles the loop rate, and gate-only Router-KD (even on full data) does not close it. Earlier small-n evals suggesting parity were a sampling fluke. A knowledge-recovery LoRA is in progress to add capacity back.

Downloads last month
126
Safetensors
Model size
292B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.2-504B-FullKD

Base model

zai-org/GLM-5.2
Quantized
(73)
this model

Collection including 0xSero/GLM-5.2-504B-FullKD