GLM-5.2-504B (REAP keep-168, NVFP4)

A 34%-pruned GLM-5.2 — 168 of 256 routed experts kept per layer (incl. the MTP layer), NVFP4-quantized, ~504B params, recovered via gate-only Router-KD to the unpruned teacher. This is the largest / highest-quality of the sibling cuts, and on a well-powered real-world eval it reaches parity with the full unpruned GLM-5.2.

🙏 Sponsor

Pruning, distillation, and evaluation ran on 8× NVIDIA B200 sponsored by Lambda. Thank you, Lambda. 🙏

What this is

Arch: GlmMoeDsaForCausalLM — 78 layers (3 dense + 75 MoE) + 1 MTP layer, DeepSeek Sparse Attention, sigmoid router (top-8), 1 shared expert, hidden 6144.
Prune: REAP (saliency = gate × ‖expert_output‖) → top-168/layer, consistent across all MoE layers and the MTP layer; n_routed_experts: 168 (loads cleanly in vLLM).
Quant: NVFP4 (modelopt) routed experts; BF16 router / attention / shared expert.

Recovery (Router-KD)

Freeze experts + backbone; train only the 75 router gates (~0.016% of params) to KL-match the unpruned GLM-5.2 teacher's next-token distribution (plain uniform weighting, lr 5e-5).

Eval — n=2000 held-out real prompts (raw sampling, no max_tokens / no timeout)

Loops are detected, not truncated. 2000 probes harvested from real coding-agent traces (codex, opencode, cursor, claude-code), held out from training.

metric	keep-168 + Router-KD
attractor / loop rate	0.072
natural-EOS rate	0.928
output diversity (distinct-4)	0.880
median output length	1267 tok

At this scale the difference vs the unpruned teacher is within noise — i.e. parity, measured on 2000 samples (not the usual n=50). The residual loops are inherent to GLM-5.2 (the unpruned teacher exhibits the same </think>-restart loops on the same prompts), so they're not a pruning artifact.

Serving (vLLM)

vllm serve 0xSero/GLM-5.2-504B --tensor-parallel-size 8 \
  --quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 262144

Tip — brevity prompt: a system prompt like "Be concise. Think only as much as the task needs, then answer and stop." roughly halves median output length at no retraining cost.

GGUF builds (BF16 + Q4/Q3/Q2 dynamic): 0xSero/GLM-5.2-REAP-504B-GGUF
Siblings: 0xSero/GLM-5.2-481B (keep-160), 0xSero/GLM-5.2-469B (keep-156)

Compute sponsored by Lambda — thank you. 🙏

Honest note (n=2000)

The unpruned teacher loops on only 3.6% of these prompts vs ~7-8% for this pruned cut — REAP pruning roughly doubles the loop rate, and gate-only Router-KD (even on full data) does not close it. Earlier small-n evals suggesting parity were a sampling fluke. A knowledge-recovery LoRA is in progress to add capacity back.

Downloads last month: 126

Safetensors

Model size

292B params

Tensor type

BF16

F32

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.2-504B-FullKD

Base model

zai-org/GLM-5.2

Quantized

(73)

this model

Collection including 0xSero/GLM-5.2-504B-FullKD

GLM — REAP

Collection

REAP-pruned & quantized GLM-4.x / 5 / 5.1 (+ Flash fine-tunes). • 25 items • Updated 3 days ago • 1

0xSero
/

GLM-5.2-504B-FullKD