- GLM-5.2-NVFP4-REAP-504B-term (termination-recalibrated variant)
- What's different from the code-calibrated
-504B - Reasoning effort — only
highandmax - What it's calibrated on (and what that means)
- Honest limitations / which sibling to pick — READ THIS
- Sampling — important
- Method (no re-quantization)
- Serving (vLLM, 4× RTX PRO 6000) —
docker-compose.ymlincluded - Credits
- What's different from the code-calibrated
GLM-5.2-NVFP4-REAP-504B-term (termination-recalibrated variant)
A REAP (Router-weighted Expert Activation Pruning) of GLM-5.2 in NVFP4, pruned 256 → 168 routed experts/layer, built to run on 4× 96 GB Blackwell GPUs (RTX PRO 6000, SM120, TP4) with vLLM (b12x).
Credit where it's due: this whole line of work exists because of 0xSero/GLM-5.2-NVFP4-REAP-469B — the first NVFP4 REAP of GLM-5.2. We started from that idea and re-derived the prune. And the base, lukealonso/GLM-5.2-NVFP4, is an excellent, tightly-packed NVFP4 quant — our experts are sliced from it byte-for-byte (no re-quantization).
This is the -term sibling of GLM-5.2-NVFP4-REAP-504B (code-calibrated). Same base, same byte-exact slicing pipeline — the one difference is the calibration set: in addition to code/agentic data, saliency here was also measured on the full model's own complete terminating reasoning traces (self-distilled), with extra weight on the </think> transition region. The aim: keep the experts that carry a long reasoning chain smoothly to its conclusion, for cleaner extended (high/max) reasoning.
| model | experts/layer | params (nominal) | size on disk |
|---|---|---|---|
| Full GLM-5.2 (luke NVFP4) | 256 | ~753B | 467.1 GB (435 GiB) |
This model (-term) |
168 | ~504B | 308.9 GB (288 GiB) |
sibling -504B (code-calib) |
168 | ~504B | 308.9 GB |
| 0xSero REAP-469B | 156 | ~469B | 307.8 GB |
(Note: on-disk GB doesn't track the nominal "B" param count — or even expert count — 1:1 once quant format and layout differ. This 168-expert model lands about even with 0xSero's 156-expert one.)
What's different from the code-calibrated -504B
- Calibration = code/agentic + the full model's terminating reasoning traces (~30 self-distilled
prompt → <think>…</think> → answertraces +</think>-region snippets weighted ×6). vs-504B's pure code calibration, this shifts the keep-set toward reasoning-flow experts. - Observed: reasons continuously and coherently (no stalling/premature pausing mid-thought) and self-terminates at
high— and atmaxgiven a generousmax_tokensbudget — by thinking at length and then writing the answer (e.g. a single-file game) rather than looping.
Reasoning effort — only high and max
GLM-5.2 exposes two reasoning levels: high and max. The chat template defaults to max and treats anything else as max; there is no low/medium/minimal. Pass reasoning_effort: "high" (e.g. via chat_template_kwargs) for shorter, faster thinking; leave it default for max. This is a heavy thinker — at max it can reason for tens of thousands of tokens before answering, so give it a generous max_tokens (≈80–120k) or it will hit the cap mid-thought.
What it's calibrated on (and what that means)
Calibrated narrow, on purpose: code + tool-calling/agentic data (evol-codealpaca, Magicoder, xLAM function-calling, SWE-smith) plus the model's own terminating reasoning traces. We did not calibrate on broad/general, multilingual, or long-document data.
- Stronger at: coding, tool use, and ending its own reasoning (the termination traces are the whole point of this variant).
- Weaker at (expected): general knowledge, other languages, niche domains — those experts scored low on a code-heavy calibration, so they got dropped.
- Long context still works — that lives in the attention, which isn't pruned (an internal 177k-token task scores 30/30); we just didn't add long-document calibration.
Want a broad general-purpose model instead? Calibrate on a wider mix (general + multilingual + long-context).
Honest limitations / which sibling to pick — READ THIS
- Not A/B-validated against the code-calibrated
-504B. Whether the trace-recalibration actually helps your workload is unmeasured; for pure coding the-504Bmay be more decisive/less verbose. Benchmark both on your own tasks. - "Max never terminates" is largely a property of the
maxtier, not a pruning defect. The full GLM-5.2 (even fp8) also produces 60k+ tokens of reasoning with no answer on hard open-ended prompts at max. This model targets coherent self-termination, not brevity. - Plausible tradeoff: calibrating on long terminating traces can bias toward longer reasoning — this variant may think more, not less, than
-504B. - NVFP4 + prune are both lossy: generative/coding degrade little (REAP paper); knowledge/MCQA recall and non-English/niche domains are the weak axes.
Sampling — important
Use repetition_penalty ≤ 1.0 (1.0 = off). A penalty > 1.0 accumulates over long generations and spirals the (heavy) reasoning into synonym/token salad.
temperature 0.6, top_p 0.95, repetition_penalty 1.0
Method (no re-quantization)
Surviving experts are luke's NVFP4 weights bit-for-bit. Saliency S_j = mean_{x active}( g_j(x) · ‖f_j(x)‖₂ ) (router gate × raw-expert-output L2, norm taken before the gate) is accumulated over the calibration set via a custom collector that dequantizes each fired expert on the fly (modelopt NVFP4QTensor, block 16) — HF Transformers can't load modelopt fused-MoE NVFP4. All 256 experts in the 75 MoE layers (3–77) are scored, pure saliency, no frequency overlay (the REAP criterion, Cerebras arXiv:2510.13999; the paper warns frequency-protection heuristics lose coherence). Per layer keep top-168 by S_j, drop 88, renormalize the router; NVFP4 tensors copied verbatim, gate.weight/e_score_correction_bias sliced 256→168, experts renumbered 0..167, nextn/MTP layer kept (→ speculative decoding). config.json → n_routed_experts = num_experts = 168.
Serving (vLLM, 4× RTX PRO 6000) — docker-compose.yml included
MODEL_DIR=/path/to/GLM-5.2-NVFP4-REAP-504B-term docker compose up -d # OpenAI API on :5001, id "GLM-5.2"
The included compose defaults to the best config found: DCP4 + MTP5 + use_index_cache → ~489k-token KV pool, ~80 tok/s single-stream codegen (30% MTP accept). 125k ctx). Tested on use_index_cache (caches the DSA top-2048 sparse indices across decode steps) is what makes DCP4 fast — previously DCP4 was comm-bound ~40 tok/s on PCIe. For max short-context decode speed: DCP_SIZE=1 MTP=1 MAX_MODEL_LEN=125000 (voipmonitor/vllm:glm52-v11-darkdevotion-...-cu132.
Credits
- 0xSero — GLM-5.2-NVFP4-REAP-469B, the prior art that started this.
- lukealonso — GLM-5.2-NVFP4 base quant (experts sliced byte-for-byte).
- REAP — Cerebras, arXiv:2510.13999.
- Base model: GLM-5.2 by Z.ai.
- Downloads last month
- -