GLM-5.2-REAP25-MLX-4bit

REAP expert-pruned + 4-bit MLX conversion of zai-org/GLM-5.2. Keeps the 192 most-salient experts per layer (of 256) → ~572B params, smaller/faster than the full model.

What is this

Pruned with REAP (Router-weighted Expert Activation Pruning, Cerebras / ICLR 2026): per MoE layer, experts are scored by mean(router_gate_weight × ‖expert_output‖) over a calibration set; the lowest-saliency experts are dropped and the router is sliced to the survivors. No retraining. n_routed_experts reduced 256→192.

Quality (held-out perplexity, Frankenstein — not in calibration)

Variant Experts ~Params Held-out PPL vs full
full GLM-5.2 (4-bit) 256 ~750B 1.447
REAP25 (this repo) 192 ~572B 1.481 +2.3%
REAP37 160 ~480B 1.553 +7.3%
REAP50 128 ~394B 1.990 +37.5%

This variant: PPL 1.481 (+2.3% vs full) — near-lossless. (Absolute PPL is low because the eval text is highly predictable; treat the numbers as relative degradation.)

Methodology

Calibrated on the 4-bit GLM-5.2 (192 seqs × 1024 tok, prose + code); pruned during MLX conversion (no intermediate bf16). Requires the glm_moe_dsa / deepseek_v32 MLX path with per-layer indexer handling.

Use with mlx-lm

pip install mlx-lm
python -m mlx_lm generate --model pipenetwork/GLM-5.2-REAP25-MLX-4bit --prompt "Hello" -m 256

License

MIT (inherited from GLM-5.2). Quantization: {"group_size": 64, "bits": 4, "mode": "affine"}.

Downloads last month
364
Safetensors
Model size
562B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pipenetwork/GLM-5.2-REAP25-MLX-4bit

Base model

zai-org/GLM-5.2
Quantized
(27)
this model

Collection including pipenetwork/GLM-5.2-REAP25-MLX-4bit