GLM-5.2-504B-Nvidia — REAP keep-168 of NVIDIA's official NVFP4 (no retraining)

A 34%-expert-pruned GLM-5.2, cut directly from nvidia/GLM-5.2-NVFP4 using the exact same REAP keep-168 expert selection as 0xSero/GLM-5.2-504B. Whole-expert copy — NVIDIA's NVFP4 weights and scales are preserved bit-for-bit; no re-quantization, no fine-tuning.

This is a pure structural prune of NVIDIA's NVFP4 checkpoint. It keeps the highest-saliency 168 of 256 routed experts per layer and drops the rest. Unlike 0xSero/GLM-5.2-504B, it is not Router-KD recovered — the router gates here are the original rows sliced down to the kept experts. Use the recommended sampler guardrail below (it fully recovers pruning-induced looping for free).


What it is

GLM-5.2 is a GlmMoeDsaForCausalLM MoE — 78 layers (3 dense + 75 MoE) + 1 MTP layer, 256 routed experts per layer (top-8) + 1 shared expert, DeepSeek-style MLA attention with a DSA sparse "indexer," hidden size 6144.

This model keeps 168 of the 256 routed experts per layer (≈504B params, down from ~744–763B), uniformly across every MoE layer and the MTP layer (n_routed_experts: 168), so it loads and serves cleanly in vLLM.

Source nvidia/GLM-5.2-NVFP4 (NVIDIA modelopt NVFP4)
Prune method REAP — saliency = gate_weight × ‖expert_output‖, top-168 kept per layer
Expert selection identical to 0xSero/GLM-5.2-504B (same REAP plan)
Recovery none — raw structural prune (gates sliced, not KD-retrained)
Quantization NVFP4 on routed experts (3–77) + FP8 KV cache, preserved verbatim from NVIDIA; MTP layer 78 BF16
Params ~504B (34% of routed experts pruned)

How it was made

A direct safetensors transform — no GPU, no training, no re-quantization:

  1. For each MoE layer (3–78), keep the top-168 experts from the REAP saliency plan; copy each kept expert whole so its NVFP4 packing (weight, weight_scale, weight_scale_2, input_scale) travels intact. NVIDIA's quantized values are bit-identical to the source.
  2. Renumber surviving experts 0..167 and slice the router (gate.weight, gate.e_score_correction_bias) to the same 168 rows.
  3. Everything else — attention, shared expert, norms, embeddings, the MTP block — is copied verbatim. n_routed_experts is set to 168.

The MTP (next-token / multi-token-prediction) layer is preserved at 168 experts, so self-speculative decoding still works.

Serving (vLLM)

vllm serve 0xSero/GLM-5.2-504B-Nvidia \
  --tensor-parallel-size 8 \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --max-model-len 262144

Recommended sampler guardrail — recover the loop cost for free

REAP pruning roughly doubles GLM-5.2's tendency to fall into repeat / </think>-restart loops (the dominant agent-use failure mode). As established in the GLM-5.2-504B report, this is fully recoverable at serving time with no retraining via a light sampler guardrail (measured at n=2000 on the keep-168 cut):

  • min_p=0.05, repetition_penalty=1.05 → gentle, safe default.
  • min_p=0.05, repetition_penalty=1.10 → drops looping to ~2.3% (below the unpruned teacher's raw 3.6%). Start at 1.05; go to 1.10 if you see loops.

Because this variant is not Router-KD recovered (the gates were not retrained), the guardrail is strongly recommended rather than optional. For the KD-recovered variant with the full evaluation, use 0xSero/GLM-5.2-504B.

Relationship to the GLM-5.2 REAP series

model source recovery use when
0xSero/GLM-5.2-504B 0xSero NVFP4 Router-KD you want the evaluated, recovered flagship
this model NVIDIA NVFP4 none you want NVIDIA's exact quantization, pruned
0xSero/GLM-5.2-REAP-504B-GGUF BF16 llama.cpp / CPU / Metal

📄 Method, evaluation, and the honest accounting of pruning cost: see the GLM-5.2-504B technical report.

Provenance & honesty

  • Expert selection is the same REAP keep-168 plan that produced 0xSero/GLM-5.2-504B.
  • NVFP4 routed-expert weights/scales are NVIDIA's, unmodified (whole-expert copy; no re-quant).
  • This specific variant was not separately re-evaluated; the behavioral numbers cited above come from the keep-168 cut in the linked report and are referenced as guidance, not as a fresh measurement of this checkpoint. It is not Router-KD recovered.

REAP expert-pruning applied to NVIDIA's NVFP4 GLM-5.2. Quantization by NVIDIA; pruning recipe from the GLM-5.2 REAP study.

Downloads last month
42
Safetensors
Model size
293B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.2-504B-Nvidia

Base model

zai-org/GLM-5.2
Quantized
(1)
this model

Collection including 0xSero/GLM-5.2-504B-Nvidia