GLM-5.2-504B-Nvidia — REAP keep-168 of NVIDIA's official NVFP4 (no retraining)
A 34%-expert-pruned GLM-5.2, cut directly from
nvidia/GLM-5.2-NVFP4using the exact same REAP keep-168 expert selection as0xSero/GLM-5.2-504B. Whole-expert copy — NVIDIA's NVFP4 weights and scales are preserved bit-for-bit; no re-quantization, no fine-tuning.
This is a pure structural prune of NVIDIA's NVFP4 checkpoint. It keeps the highest-saliency 168 of
256 routed experts per layer and drops the rest. Unlike 0xSero/GLM-5.2-504B,
it is not Router-KD recovered — the router gates here are the original rows sliced down to the kept
experts. Use the recommended sampler guardrail below (it fully recovers pruning-induced looping for free).
What it is
GLM-5.2 is a GlmMoeDsaForCausalLM MoE — 78 layers (3 dense + 75 MoE) + 1 MTP layer,
256 routed experts per layer (top-8) + 1 shared expert, DeepSeek-style MLA attention with a DSA
sparse "indexer," hidden size 6144.
This model keeps 168 of the 256 routed experts per layer (≈504B params, down from ~744–763B),
uniformly across every MoE layer and the MTP layer (n_routed_experts: 168), so it loads and
serves cleanly in vLLM.
| Source | nvidia/GLM-5.2-NVFP4 (NVIDIA modelopt NVFP4) |
| Prune method | REAP — saliency = gate_weight × ‖expert_output‖, top-168 kept per layer |
| Expert selection | identical to 0xSero/GLM-5.2-504B (same REAP plan) |
| Recovery | none — raw structural prune (gates sliced, not KD-retrained) |
| Quantization | NVFP4 on routed experts (3–77) + FP8 KV cache, preserved verbatim from NVIDIA; MTP layer 78 BF16 |
| Params | ~504B (34% of routed experts pruned) |
How it was made
A direct safetensors transform — no GPU, no training, no re-quantization:
- For each MoE layer (3–78), keep the top-168 experts from the REAP saliency plan; copy each kept
expert whole so its NVFP4 packing (
weight,weight_scale,weight_scale_2,input_scale) travels intact. NVIDIA's quantized values are bit-identical to the source. - Renumber surviving experts
0..167and slice the router (gate.weight,gate.e_score_correction_bias) to the same 168 rows. - Everything else — attention, shared expert, norms, embeddings, the MTP block — is copied verbatim.
n_routed_expertsis set to 168.
The MTP (next-token / multi-token-prediction) layer is preserved at 168 experts, so self-speculative decoding still works.
Serving (vLLM)
vllm serve 0xSero/GLM-5.2-504B-Nvidia \
--tensor-parallel-size 8 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--max-model-len 262144
Recommended sampler guardrail — recover the loop cost for free
REAP pruning roughly doubles GLM-5.2's tendency to fall into repeat / </think>-restart loops (the
dominant agent-use failure mode). As established in the GLM-5.2-504B report,
this is fully recoverable at serving time with no retraining via a light sampler guardrail
(measured at n=2000 on the keep-168 cut):
min_p=0.05, repetition_penalty=1.05→ gentle, safe default.min_p=0.05, repetition_penalty=1.10→ drops looping to ~2.3% (below the unpruned teacher's raw 3.6%). Start at 1.05; go to 1.10 if you see loops.
Because this variant is not Router-KD recovered (the gates were not retrained), the guardrail is
strongly recommended rather than optional. For the KD-recovered variant with the full evaluation,
use 0xSero/GLM-5.2-504B.
Relationship to the GLM-5.2 REAP series
| model | source | recovery | use when |
|---|---|---|---|
0xSero/GLM-5.2-504B |
0xSero NVFP4 | Router-KD | you want the evaluated, recovered flagship |
| this model | NVIDIA NVFP4 | none | you want NVIDIA's exact quantization, pruned |
0xSero/GLM-5.2-REAP-504B-GGUF |
BF16 | — | llama.cpp / CPU / Metal |
📄 Method, evaluation, and the honest accounting of pruning cost: see the GLM-5.2-504B technical report.
Provenance & honesty
- Expert selection is the same REAP keep-168 plan that produced
0xSero/GLM-5.2-504B. - NVFP4 routed-expert weights/scales are NVIDIA's, unmodified (whole-expert copy; no re-quant).
- This specific variant was not separately re-evaluated; the behavioral numbers cited above come from the keep-168 cut in the linked report and are referenced as guidance, not as a fresh measurement of this checkpoint. It is not Router-KD recovered.
REAP expert-pruning applied to NVIDIA's NVFP4 GLM-5.2. Quantization by NVIDIA; pruning recipe from the GLM-5.2 REAP study.
- Downloads last month
- 42