GLM-5.2-NVFP4-REAP-504B (code-calibrated, loop-free)
A REAP (Router-weighted Expert Activation Pruning) of GLM-5.2 in NVFP4, pruned 256 → 168 routed experts/layer (34% removed) via pure-saliency scoring on code/agentic calibration data. Runs on 4× 96 GB Blackwell GPUs (RTX PRO 6000, SM120, TP4) with vLLM (b12x), at long context, and — unlike earlier GLM-5.2 REAPs, and with the recommended sampling (below) — does not collapse into a thinking-loop on open-ended generation.
Credit where it's due: this work exists because of 0xSero/GLM-5.2-NVFP4-REAP-469B — the first NVFP4 REAP of GLM-5.2, and the reason we started down this path. The base, lukealonso/GLM-5.2-NVFP4, is an excellent, tightly-packed NVFP4 quant — our experts are sliced from it byte-for-byte (no re-quantization).
| model | experts/layer | params (nominal) | size on disk |
|---|---|---|---|
| Full GLM-5.2 (luke NVFP4) | 256 | ~753B | 467.1 GB (435 GiB) |
| This model | 168 | ~504B | 308.9 GB (288 GiB) |
-504B-term sibling |
168 | ~504B | 308.9 GB |
| 0xSero REAP-469B | 156 | ~469B | 307.8 GB |
(Note: on-disk GB doesn't track the nominal "B" param count — or even expert count — 1:1 once quant format and layout differ; this 168-expert model lands about even with 0xSero's 156-expert one. There's also a -504B-term sibling recalibrated to preserve long-reasoning termination — more verbose; pick per your workload.)
Why this one doesn't loop
Earlier GLM-5.2 REAPs tend to over-elaborate inside <think> and never emit an answer. This model addresses it on three fronts (plus a sampling note below that matters as much as any of them):
- Pure-saliency selection, no frequency overlay. Experts kept by
S_j = mean_active_tokens( gate_weight_j · ‖expert_output_j‖₂ )— the REAP criterion (Cerebras, arXiv:2510.13999) — and only that. The paper warns frequency-protection heuristics "lose the ability to produce coherent output." - Code/agentic calibration. Saliency measured on code & tool-use (evol-codealpaca, Magicoder, xLAM function-calling, SWE-smith, math) — REAP §5.1 shows general-domain calibration can collapse coding to incoherence.
- Gentler prune (168 vs 156). More coherence margin.
The trimmed 88 experts/layer are the diffuse low-amplitude tail (~19% of saliency mass, no single expert > ~0.3% of its layer — no hub expert removed).
Reasoning effort — only high and max
GLM-5.2 exposes two reasoning levels: high and max. The chat template defaults to max and treats anything else as max (no low/medium/minimal). Pass reasoning_effort: "high" (e.g. via chat_template_kwargs) for shorter, faster thinking. GLM-5.2 is a heavy thinker at max — give it a generous max_tokens so it isn't cut off mid-thought.
Sampling — important
Use repetition_penalty ≤ 1.0 (1.0 = off). A penalty > 1.0 accumulates over long generations and can spiral the (heavy) reasoning into synonym/token salad — this, more than the prune, was the real cause of "loops" in earlier attempts.
temperature 0.6, top_p 0.95, repetition_penalty 1.0
Method (no re-quantization)
Surviving experts are luke's NVFP4 weights bit-for-bit. A custom collector builds GLM-5.2 on meta, swaps each fused experts for luke's separate per-expert NVFP4 tensors (gate/up/down = U8 4-bit weight + E4M3 per-group-16 weight_scale + F32 weight_scale_2 + F32 input_scale), and on every forward dequantizes each fired expert on the fly (modelopt NVFP4QTensor, block 16) to accumulate S_j — HF Transformers can't load modelopt fused-MoE NVFP4. All 256 experts in the 75 MoE layers (3–77) are scored, no frequency term. Per layer keep top-168, renormalize the router; NVFP4 copied verbatim, gate.weight/e_score_correction_bias sliced 256→168, experts renumbered 0..167, nextn/MTP layer kept (→ speculative decoding). config.json → n_routed_experts = num_experts = 168. vLLM (b12x) remaps separate-per-expert NVFP4 into a fused MoE at load.
Serving (vLLM, 4× RTX PRO 6000) — docker-compose.yml included
MODEL_DIR=/path/to/GLM-5.2-NVFP4-REAP-504B docker compose up -d # OpenAI API on :5001, id "GLM-5.2"
The included compose defaults to the best config found: DCP4 + MTP5 + use_index_cache → ~489k-token KV pool, ~80 tok/s single-stream codegen (~30% MTP accept). use_index_cache (caches the DSA top-2048 sparse indices across decode steps) makes DCP4 fast — previously comm-bound ~40 tok/s on PCIe. For max short-context decode speed: DCP_SIZE=1 MTP=1 MAX_MODEL_LEN=125000. Tested on voipmonitor/vllm:glm52-v11-darkdevotion-...-cu132.
What it's calibrated on (and what that means)
Calibrated narrow, on purpose: code + tool-calling/agentic data (evol-codealpaca, Magicoder, xLAM function-calling, SWE-smith).
- Stronger at: coding, tool use, agentic tasks.
- Weaker at (expected): general knowledge, other languages, niche domains.
- Long context still works — it lives in the attention, which isn't pruned.
Want a broad general-purpose model instead? Calibrate on a wider mix (general + multilingual + long-context).
Honest limitations
- Not yet run through a full benchmark suite vs the full model — quality claims are from coherence/codegen spot-checks (single-file Tetris/Mario, no loop), not a measured "within X%".
- NVFP4 quantization and expert pruning are both lossy; per REAP, generative/coding degrades little but knowledge/MCQA recall is the weakest axis; non-English/niche domains most at risk.
- fp8 KV cache is a suspected source of rare-token degradation at very long context; switch to bf16 KV if observed (halves context).
Credits
- 0xSero — GLM-5.2-NVFP4-REAP-469B, the prior art that started this.
- lukealonso — GLM-5.2-NVFP4 base quant (experts sliced byte-for-byte).
- REAP — Cerebras, arXiv:2510.13999. Base model: GLM-5.2 by Z.ai.
- Downloads last month
- -