GLM-5.2-NVFP4-REAP-504B (code-calibrated, loop-free)

A REAP (Router-weighted Expert Activation Pruning) of GLM-5.2 in NVFP4, pruned 256 → 168 routed experts/layer (34% removed) via pure-saliency scoring on code/agentic calibration data. Runs on 4× 96 GB Blackwell GPUs (RTX PRO 6000, SM120, TP4) with vLLM (b12x), at long context, and — unlike earlier GLM-5.2 REAPs, and with the recommended sampling (below) — does not collapse into a thinking-loop on open-ended generation.

Credit where it's due: this work exists because of 0xSero/GLM-5.2-NVFP4-REAP-469B — the first NVFP4 REAP of GLM-5.2, and the reason we started down this path. The base, lukealonso/GLM-5.2-NVFP4, is an excellent, tightly-packed NVFP4 quant — our experts are sliced from it byte-for-byte (no re-quantization).

model experts/layer params (nominal) size on disk
Full GLM-5.2 (luke NVFP4) 256 ~753B 467.1 GB (435 GiB)
This model 168 ~504B 308.9 GB (288 GiB)
-504B-term sibling 168 ~504B 308.9 GB
0xSero REAP-469B 156 ~469B 307.8 GB

(Note: on-disk GB doesn't track the nominal "B" param count — or even expert count — 1:1 once quant format and layout differ; this 168-expert model lands about even with 0xSero's 156-expert one. There's also a -504B-term sibling recalibrated to preserve long-reasoning termination — more verbose; pick per your workload.)

Why this one doesn't loop

Earlier GLM-5.2 REAPs tend to over-elaborate inside <think> and never emit an answer. This model addresses it on three fronts (plus a sampling note below that matters as much as any of them):

  1. Pure-saliency selection, no frequency overlay. Experts kept by S_j = mean_active_tokens( gate_weight_j · ‖expert_output_j‖₂ ) — the REAP criterion (Cerebras, arXiv:2510.13999) — and only that. The paper warns frequency-protection heuristics "lose the ability to produce coherent output."
  2. Code/agentic calibration. Saliency measured on code & tool-use (evol-codealpaca, Magicoder, xLAM function-calling, SWE-smith, math) — REAP §5.1 shows general-domain calibration can collapse coding to incoherence.
  3. Gentler prune (168 vs 156). More coherence margin.

The trimmed 88 experts/layer are the diffuse low-amplitude tail (~19% of saliency mass, no single expert > ~0.3% of its layer — no hub expert removed).

Reasoning effort — only high and max

GLM-5.2 exposes two reasoning levels: high and max. The chat template defaults to max and treats anything else as max (no low/medium/minimal). Pass reasoning_effort: "high" (e.g. via chat_template_kwargs) for shorter, faster thinking. GLM-5.2 is a heavy thinker at max — give it a generous max_tokens so it isn't cut off mid-thought.

Sampling — important

Use repetition_penalty ≤ 1.0 (1.0 = off). A penalty > 1.0 accumulates over long generations and can spiral the (heavy) reasoning into synonym/token salad — this, more than the prune, was the real cause of "loops" in earlier attempts.

temperature 0.6, top_p 0.95, repetition_penalty 1.0

Method (no re-quantization)

Surviving experts are luke's NVFP4 weights bit-for-bit. A custom collector builds GLM-5.2 on meta, swaps each fused experts for luke's separate per-expert NVFP4 tensors (gate/up/down = U8 4-bit weight + E4M3 per-group-16 weight_scale + F32 weight_scale_2 + F32 input_scale), and on every forward dequantizes each fired expert on the fly (modelopt NVFP4QTensor, block 16) to accumulate S_j — HF Transformers can't load modelopt fused-MoE NVFP4. All 256 experts in the 75 MoE layers (3–77) are scored, no frequency term. Per layer keep top-168, renormalize the router; NVFP4 copied verbatim, gate.weight/e_score_correction_bias sliced 256→168, experts renumbered 0..167, nextn/MTP layer kept (→ speculative decoding). config.json → n_routed_experts = num_experts = 168. vLLM (b12x) remaps separate-per-expert NVFP4 into a fused MoE at load.

Serving (vLLM, 4× RTX PRO 6000) — docker-compose.yml included

MODEL_DIR=/path/to/GLM-5.2-NVFP4-REAP-504B docker compose up -d   # OpenAI API on :5001, id "GLM-5.2"

The included compose defaults to the best config found: DCP4 + MTP5 + use_index_cache~489k-token KV pool, ~80 tok/s single-stream codegen (~30% MTP accept). use_index_cache (caches the DSA top-2048 sparse indices across decode steps) makes DCP4 fast — previously comm-bound ~40 tok/s on PCIe. For max short-context decode speed: DCP_SIZE=1 MTP=1 MAX_MODEL_LEN=125000. Tested on voipmonitor/vllm:glm52-v11-darkdevotion-...-cu132.

What it's calibrated on (and what that means)

Calibrated narrow, on purpose: code + tool-calling/agentic data (evol-codealpaca, Magicoder, xLAM function-calling, SWE-smith).

  • Stronger at: coding, tool use, agentic tasks.
  • Weaker at (expected): general knowledge, other languages, niche domains.
  • Long context still works — it lives in the attention, which isn't pruned.

Want a broad general-purpose model instead? Calibrate on a wider mix (general + multilingual + long-context).

Honest limitations

  • Not yet run through a full benchmark suite vs the full model — quality claims are from coherence/codegen spot-checks (single-file Tetris/Mario, no loop), not a measured "within X%".
  • NVFP4 quantization and expert pruning are both lossy; per REAP, generative/coding degrades little but knowledge/MCQA recall is the weakest axis; non-English/niche domains most at risk.
  • fp8 KV cache is a suspected source of rare-token degradation at very long context; switch to bf16 KV if observed (halves context).

Credits

Downloads last month
-
Safetensors
Model size
290B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for madeby561/GLM-5.2-NVFP4-REAP-504B

Base model

zai-org/GLM-5.2
Quantized
(2)
this model

Paper for madeby561/GLM-5.2-NVFP4-REAP-504B