Instructions to use 0xSero/GLM-5.1-555B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xSero/GLM-5.1-555B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="0xSero/GLM-5.1-555B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-5.1-555B") model = AutoModelForCausalLM.from_pretrained("0xSero/GLM-5.1-555B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use 0xSero/GLM-5.1-555B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/GLM-5.1-555B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-5.1-555B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/0xSero/GLM-5.1-555B
- SGLang
How to use 0xSero/GLM-5.1-555B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0xSero/GLM-5.1-555B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-5.1-555B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0xSero/GLM-5.1-555B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-5.1-555B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use 0xSero/GLM-5.1-555B with Docker Model Runner:
docker model run hf.co/0xSero/GLM-5.1-555B
Support this work → · X · GitHub · REAP paper · Cerebras REAP
GLM-5.1-555B
REAP-pruned zai-org/GLM-5.1.
At a glance
| Base model | zai-org/GLM-5.1 |
| Format | BF16 |
| Total params | 555B |
| Active / token | 14B |
| Experts / layer | 192 |
| Layers | 78 |
| Hidden size | 6144 |
| Context | 202,752 |
| On-disk size | 1125 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
GLM-5.1-444B |
BF16 | link |
GLM-5.1-444B-GGUF |
GGUF | link |
GLM-5.1-478B-NVFP4 |
NVFP4 | link |
GLM-5.1-555B (this) |
BF16 | link |
GLM-5.1-555B-GGUF |
GGUF | link |
GLM-5.1-555B-NVFP4 |
NVFP4 | link |
GLM-5.1-555B-W4A16 |
W4A16 | link |
DO NOT USE THIS MODEL FOR ANYTHING SERIOUS.
This checkpoint has not been benchmarked, validated, or tested for coherence. It may produce garbage, repetitive loops, incoherent text, or complete nonsense. Treat it as a broken artifact until proven otherwise.
GLM-5.1 — 25% Expert Pruned (REAP)
This is a 25% expert-pruned version of zai-org/GLM-5.1 using the REAP method (Relative Expert Activation Pruning).
| Property | Value |
|---|---|
| Base model | zai-org/GLM-5.1 |
| Architecture | GlmMoeDsaForCausalLM (MoE with Dynamic Sparse Attention) |
| Params before prune | 743.91B |
| Params after prune | ~555B |
| Parameter reduction | 25.4% |
| Routed experts per layer | 256 → 192 (removed 64) |
| Shared experts per layer | 1 (unchanged) |
| Active params/token | ~14B (top-8 routing preserved) |
| Precision | BF16 |
| Prune method | REAP (layerwise, refusal_contrast_reap, renorm) |
| Sparse MoE layers | 75 of 78 total (first 3 are dense) |
| Estimated max per-layer REAP signal loss | ~15.8% |
| Observation coverage | 6144/6999 packed batches, 7707/22000 samples (~35% of planned calibration) |
Why This Might Be Broken
Partial calibration data — The saliency scores used to select experts for removal were computed from only ~35% of the planned 22,000-sample calibration corpus. Expert importance rankings may be inaccurate.
No quality testing whatsoever — Zero benchmarks have been run. No coherence check. No perplexity measurement. No human evaluation. The model could produce degenerate output for all we know.
Aggressive prune ratio — Prior experiments with GLM-family models at similar or higher prune ratios resulted in complete output collapse (repetitive text, broken reasoning, junk logits). The 50% checkpoint in particular is very likely broken based on prior GLM-5 evidence.
DSA architecture sensitivity — GLM-5.1 uses Dynamic Sparse Attention with learned indexer weights. The interaction between pruned expert routing and the DSA indexer has not been validated.
refusal_contrast_reap without preserve guards — The pruning was done using
refusal_contrast_reapselection withoutpreserve_superorpreserve_outlierguardrails, which in prior GLM-5 experiments led to output collapse at high prune ratios.
What This Is Useful For
- Research only. Specifically:
- Studying REAP expert saliency patterns in GLM-5.1
- Comparing prune-ratio robustness across architectures
- Running your own coherence/benchmark evaluations
- Investigating MoE collapse behavior
How to Load
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"0xSero/GLM-5.1-555B",
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-5.1-555B", trust_remote_code=True)
# IMPORTANT: GLM-5.1 is a thinking/chat model. Use the chat template.
messages = [{"role": "user", "content": "Hello"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
out = model.generate(inputs.to(model.device), max_new_tokens=128)
print(tokenizer.decode(out[0]))
Pruning Method
REAP (Relative Expert Activation Pruning) removes MoE experts by measuring their relative activation patterns during a calibration pass. Experts with the lowest saliency scores (combined REAP signal + frequency weighting) are removed layer-by-layer, keeping top-8 routing unchanged so the active-parameter budget per token stays the same.
Sibling Checkpoints
| Prune % | Total Params | Experts/layer | HuggingFace |
|---|---|---|---|
| 25% | ~555B | 192/256 | 0xSero/GLM-5.1-555B |
| 40% | 455B | 154/256 | 0xSero/GLM-5.1-444B |
| 50% | ~367B | 128/256 | 0xSero/GLM-5.1-367B-A14B-REAP |
All three are untested. The 25% checkpoint is the most likely to be coherent.
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
- Downloads last month
- 33