GLM-5.2-NVFP4-REAP-504B-term (termination-recalibrated variant)

A REAP (Router-weighted Expert Activation Pruning) of GLM-5.2 in NVFP4, pruned 256 → 168 routed experts/layer, built to run on 4× 96 GB Blackwell GPUs (RTX PRO 6000, SM120, TP4) with vLLM (b12x).

Credit where it's due: this whole line of work exists because of 0xSero/GLM-5.2-NVFP4-REAP-469B — the first NVFP4 REAP of GLM-5.2. We started from that idea and re-derived the prune. And the base, lukealonso/GLM-5.2-NVFP4, is an excellent, tightly-packed NVFP4 quant — our experts are sliced from it byte-for-byte (no re-quantization).

This is the -term sibling of GLM-5.2-NVFP4-REAP-504B (code-calibrated). Same base, same byte-exact slicing pipeline — the one difference is the calibration set: in addition to code/agentic data, saliency here was also measured on the full model's own complete terminating reasoning traces (self-distilled), with extra weight on the </think> transition region. The aim: keep the experts that carry a long reasoning chain smoothly to its conclusion, for cleaner extended (high/max) reasoning.

model	experts/layer	params (nominal)	size on disk
Full GLM-5.2 (luke NVFP4)	256	~753B	467.1 GB (435 GiB)
This model (`-term`)	168	~504B	308.9 GB (288 GiB)
sibling `-504B` (code-calib)	168	~504B	308.9 GB
0xSero REAP-469B	156	~469B	307.8 GB

(Note: on-disk GB doesn't track the nominal "B" param count — or even expert count — 1:1 once quant format and layout differ. This 168-expert model lands about even with 0xSero's 156-expert one.)

What's different from the code-calibrated `-504B`

Calibration = code/agentic + the full model's terminating reasoning traces (~30 self-distilled prompt → <think>…</think> → answer traces + </think>-region snippets weighted ×6). vs -504B's pure code calibration, this shifts the keep-set toward reasoning-flow experts.
Observed: reasons continuously and coherently (no stalling/premature pausing mid-thought) and self-terminates at high — and at max given a generous max_tokens budget — by thinking at length and then writing the answer (e.g. a single-file game) rather than looping.

Reasoning effort — only `high` and `max`

GLM-5.2 exposes two reasoning levels: high and max. The chat template defaults to max and treats anything else as max; there is no low/medium/minimal. Pass reasoning_effort: "high" (e.g. via chat_template_kwargs) for shorter, faster thinking; leave it default for max. This is a heavy thinker — at max it can reason for tens of thousands of tokens before answering, so give it a generous max_tokens (≈80–120k) or it will hit the cap mid-thought.

What it's calibrated on (and what that means)

Calibrated narrow, on purpose: code + tool-calling/agentic data (evol-codealpaca, Magicoder, xLAM function-calling, SWE-smith) plus the model's own terminating reasoning traces. We did not calibrate on broad/general, multilingual, or long-document data.

Stronger at: coding, tool use, and ending its own reasoning (the termination traces are the whole point of this variant).
Weaker at (expected): general knowledge, other languages, niche domains — those experts scored low on a code-heavy calibration, so they got dropped.
Long context still works — that lives in the attention, which isn't pruned (an internal 177k-token task scores 30/30); we just didn't add long-document calibration.

Want a broad general-purpose model instead? Calibrate on a wider mix (general + multilingual + long-context).

Honest limitations / which sibling to pick — READ THIS

Not A/B-validated against the code-calibrated -504B. Whether the trace-recalibration actually helps your workload is unmeasured; for pure coding the -504B may be more decisive/less verbose. Benchmark both on your own tasks.
"Max never terminates" is largely a property of the max tier, not a pruning defect. The full GLM-5.2 (even fp8) also produces 60k+ tokens of reasoning with no answer on hard open-ended prompts at max. This model targets coherent self-termination, not brevity.
Plausible tradeoff: calibrating on long terminating traces can bias toward longer reasoning — this variant may think more, not less, than -504B.
NVFP4 + prune are both lossy: generative/coding degrade little (REAP paper); knowledge/MCQA recall and non-English/niche domains are the weak axes.

Sampling — important

Use repetition_penalty ≤ 1.0 (1.0 = off). A penalty > 1.0 accumulates over long generations and spirals the (heavy) reasoning into synonym/token salad.

temperature 0.6, top_p 0.95, repetition_penalty 1.0

Method (no re-quantization)

Surviving experts are luke's NVFP4 weights bit-for-bit. Saliency S_j = mean_{x active}( g_j(x) · ‖f_j(x)‖₂ ) (router gate × raw-expert-output L2, norm taken before the gate) is accumulated over the calibration set via a custom collector that dequantizes each fired expert on the fly (modelopt NVFP4QTensor, block 16) — HF Transformers can't load modelopt fused-MoE NVFP4. All 256 experts in the 75 MoE layers (3–77) are scored, pure saliency, no frequency overlay (the REAP criterion, Cerebras arXiv:2510.13999; the paper warns frequency-protection heuristics lose coherence). Per layer keep top-168 by S_j, drop 88, renormalize the router; NVFP4 tensors copied verbatim, gate.weight/e_score_correction_bias sliced 256→168, experts renumbered 0..167, nextn/MTP layer kept (→ speculative decoding). config.json → n_routed_experts = num_experts = 168.

Serving (vLLM, 4× RTX PRO 6000) — `docker-compose.yml` included

MODEL_DIR=/path/to/GLM-5.2-NVFP4-REAP-504B-term docker compose up -d   # OpenAI API on :5001, id "GLM-5.2"

The included compose defaults to the best config found: DCP4 + MTP5 + use_index_cache → ~489k-token KV pool, ~80 tok/s single-stream codegen (30% MTP accept). use_index_cache (caches the DSA top-2048 sparse indices across decode steps) is what makes DCP4 fast — previously DCP4 was comm-bound ~40 tok/s on PCIe. For max short-context decode speed: DCP_SIZE=1 MTP=1 MAX_MODEL_LEN=125000 (125k ctx). Tested on voipmonitor/vllm:glm52-v11-darkdevotion-...-cu132.

Credits

0xSero — GLM-5.2-NVFP4-REAP-469B, the prior art that started this.
lukealonso — GLM-5.2-NVFP4 base quant (experts sliced byte-for-byte).
REAP — Cerebras, arXiv:2510.13999.
Base model: GLM-5.2 by Z.ai.

Downloads last month: -

Safetensors

Model size

290B params

Tensor type

BF16

F8_E4M3

F32

Model tree for madeby561/GLM-5.2-NVFP4-REAP-504B-term

Base model

zai-org/GLM-5.2

Quantized

lukealonso/GLM-5.2-NVFP4

Quantized

(2)

this model

Paper for madeby561/GLM-5.2-NVFP4-REAP-504B-term

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20