Instructions to use 0xSero/GLM-5.1-555B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/GLM-5.1-555B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/GLM-5.1-555B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-5.1-555B")
model = AutoModelForCausalLM.from_pretrained("0xSero/GLM-5.1-555B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 0xSero/GLM-5.1-555B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/GLM-5.1-555B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-555B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/GLM-5.1-555B

SGLang

How to use 0xSero/GLM-5.1-555B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/GLM-5.1-555B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-555B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/GLM-5.1-555B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-555B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/GLM-5.1-555B with Docker Model Runner:
```
docker model run hf.co/0xSero/GLM-5.1-555B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Support this work → · X · GitHub · REAP paper · Cerebras REAP

GLM-5.1-555B

REAP-pruned zai-org/GLM-5.1.

At a glance


Base model	zai-org/GLM-5.1
Format	BF16
Total params	555B
Active / token	14B
Experts / layer	192
Layers	78
Hidden size	6144
Context	202,752
On-disk size	1125 GB

Which variant should I pick?

Variant	Format	Link
`GLM-5.1-444B`	BF16	link
`GLM-5.1-444B-GGUF`	GGUF	link
`GLM-5.1-478B-NVFP4`	NVFP4	link
`GLM-5.1-555B` (this)	BF16	link
`GLM-5.1-555B-GGUF`	GGUF	link
`GLM-5.1-555B-NVFP4`	NVFP4	link
`GLM-5.1-555B-W4A16`	W4A16	link

DO NOT USE THIS MODEL FOR ANYTHING SERIOUS.

This checkpoint has not been benchmarked, validated, or tested for coherence. It may produce garbage, repetitive loops, incoherent text, or complete nonsense. Treat it as a broken artifact until proven otherwise.

GLM-5.1 — 25% Expert Pruned (REAP)

This is a 25% expert-pruned version of zai-org/GLM-5.1 using the REAP method (Relative Expert Activation Pruning).

Property	Value
Base model	`zai-org/GLM-5.1`
Architecture	`GlmMoeDsaForCausalLM` (MoE with Dynamic Sparse Attention)
Params before prune	743.91B
Params after prune	~555B
Parameter reduction	25.4%
Routed experts per layer	256 → 192 (removed 64)
Shared experts per layer	1 (unchanged)
Active params/token	~14B (top-8 routing preserved)
Precision	BF16
Prune method	REAP (layerwise, refusal_contrast_reap, renorm)
Sparse MoE layers	75 of 78 total (first 3 are dense)
Estimated max per-layer REAP signal loss	~15.8%
Observation coverage	6144/6999 packed batches, 7707/22000 samples (~35% of planned calibration)

Why This Might Be Broken

Partial calibration data — The saliency scores used to select experts for removal were computed from only ~35% of the planned 22,000-sample calibration corpus. Expert importance rankings may be inaccurate.
No quality testing whatsoever — Zero benchmarks have been run. No coherence check. No perplexity measurement. No human evaluation. The model could produce degenerate output for all we know.
Aggressive prune ratio — Prior experiments with GLM-family models at similar or higher prune ratios resulted in complete output collapse (repetitive text, broken reasoning, junk logits). The 50% checkpoint in particular is very likely broken based on prior GLM-5 evidence.
DSA architecture sensitivity — GLM-5.1 uses Dynamic Sparse Attention with learned indexer weights. The interaction between pruned expert routing and the DSA indexer has not been validated.
refusal_contrast_reap without preserve guards — The pruning was done using refusal_contrast_reap selection without preserve_super or preserve_outlier guardrails, which in prior GLM-5 experiments led to output collapse at high prune ratios.

What This Is Useful For

Research only. Specifically:
- Studying REAP expert saliency patterns in GLM-5.1
- Comparing prune-ratio robustness across architectures
- Running your own coherence/benchmark evaluations
- Investigating MoE collapse behavior

How to Load

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/GLM-5.1-555B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-5.1-555B", trust_remote_code=True)

# IMPORTANT: GLM-5.1 is a thinking/chat model. Use the chat template.
messages = [{"role": "user", "content": "Hello"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
out = model.generate(inputs.to(model.device), max_new_tokens=128)
print(tokenizer.decode(out[0]))

Pruning Method

REAP (Relative Expert Activation Pruning) removes MoE experts by measuring their relative activation patterns during a calibration pass. Experts with the lowest saliency scores (combined REAP signal + frequency weighting) are removed layer-by-layer, keeping top-8 routing unchanged so the active-parameter budget per token stays the same.

Sibling Checkpoints

Prune %	Total Params	Experts/layer	HuggingFace
25%	~555B	192/256	`0xSero/GLM-5.1-555B`
40%	455B	154/256	`0xSero/GLM-5.1-444B`
50%	~367B	128/256	`0xSero/GLM-5.1-367B-A14B-REAP`

All three are untested. The 25% checkpoint is the most likely to be coherent.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}