Support this work → · X · GitHub · REAP paper · Cerebras REAP

GLM-5.1-555B

REAP-pruned zai-org/GLM-5.1.

At a glance

Base model zai-org/GLM-5.1
Format BF16
Total params 555B
Active / token 14B
Experts / layer 192
Layers 78
Hidden size 6144
Context 202,752
On-disk size 1125 GB

Which variant should I pick?

Variant Format Link
GLM-5.1-444B BF16 link
GLM-5.1-444B-GGUF GGUF link
GLM-5.1-478B-NVFP4 NVFP4 link
GLM-5.1-555B (this) BF16 link
GLM-5.1-555B-GGUF GGUF link
GLM-5.1-555B-NVFP4 NVFP4 link
GLM-5.1-555B-W4A16 W4A16 link

DO NOT USE THIS MODEL FOR ANYTHING SERIOUS.

This checkpoint has not been benchmarked, validated, or tested for coherence. It may produce garbage, repetitive loops, incoherent text, or complete nonsense. Treat it as a broken artifact until proven otherwise.


GLM-5.1 — 25% Expert Pruned (REAP)

This is a 25% expert-pruned version of zai-org/GLM-5.1 using the REAP method (Relative Expert Activation Pruning).

Property Value
Base model zai-org/GLM-5.1
Architecture GlmMoeDsaForCausalLM (MoE with Dynamic Sparse Attention)
Params before prune 743.91B
Params after prune ~555B
Parameter reduction 25.4%
Routed experts per layer 256 → 192 (removed 64)
Shared experts per layer 1 (unchanged)
Active params/token ~14B (top-8 routing preserved)
Precision BF16
Prune method REAP (layerwise, refusal_contrast_reap, renorm)
Sparse MoE layers 75 of 78 total (first 3 are dense)
Estimated max per-layer REAP signal loss ~15.8%
Observation coverage 6144/6999 packed batches, 7707/22000 samples (~35% of planned calibration)

Why This Might Be Broken

  1. Partial calibration data — The saliency scores used to select experts for removal were computed from only ~35% of the planned 22,000-sample calibration corpus. Expert importance rankings may be inaccurate.

  2. No quality testing whatsoever — Zero benchmarks have been run. No coherence check. No perplexity measurement. No human evaluation. The model could produce degenerate output for all we know.

  3. Aggressive prune ratio — Prior experiments with GLM-family models at similar or higher prune ratios resulted in complete output collapse (repetitive text, broken reasoning, junk logits). The 50% checkpoint in particular is very likely broken based on prior GLM-5 evidence.

  4. DSA architecture sensitivity — GLM-5.1 uses Dynamic Sparse Attention with learned indexer weights. The interaction between pruned expert routing and the DSA indexer has not been validated.

  5. refusal_contrast_reap without preserve guards — The pruning was done using refusal_contrast_reap selection without preserve_super or preserve_outlier guardrails, which in prior GLM-5 experiments led to output collapse at high prune ratios.

What This Is Useful For

  • Research only. Specifically:
    • Studying REAP expert saliency patterns in GLM-5.1
    • Comparing prune-ratio robustness across architectures
    • Running your own coherence/benchmark evaluations
    • Investigating MoE collapse behavior

How to Load

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/GLM-5.1-555B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-5.1-555B", trust_remote_code=True)

# IMPORTANT: GLM-5.1 is a thinking/chat model. Use the chat template.
messages = [{"role": "user", "content": "Hello"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
out = model.generate(inputs.to(model.device), max_new_tokens=128)
print(tokenizer.decode(out[0]))

Pruning Method

REAP (Relative Expert Activation Pruning) removes MoE experts by measuring their relative activation patterns during a calibration pass. Experts with the lowest saliency scores (combined REAP signal + frequency weighting) are removed layer-by-layer, keeping top-8 routing unchanged so the active-parameter budget per token stays the same.

Sibling Checkpoints

Prune % Total Params Experts/layer HuggingFace
25% ~555B 192/256 0xSero/GLM-5.1-555B
40% 455B 154/256 0xSero/GLM-5.1-444B
50% ~367B 128/256 0xSero/GLM-5.1-367B-A14B-REAP

All three are untested. The 25% checkpoint is the most likely to be coherent.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
33
Safetensors
Model size
563B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.1-555B

Base model

zai-org/GLM-5.1
Finetuned
(9)
this model
Quantizations
2 models

Space using 0xSero/GLM-5.1-555B 1

Collections including 0xSero/GLM-5.1-555B

Paper for 0xSero/GLM-5.1-555B