Instructions to use 0xSero/GLM-5.1-444B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/GLM-5.1-444B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/GLM-5.1-444B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-5.1-444B")
model = AutoModelForCausalLM.from_pretrained("0xSero/GLM-5.1-444B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 0xSero/GLM-5.1-444B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/GLM-5.1-444B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-444B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/GLM-5.1-444B

SGLang

How to use 0xSero/GLM-5.1-444B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/GLM-5.1-444B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-444B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/GLM-5.1-444B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-444B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/GLM-5.1-444B with Docker Model Runner:
```
docker model run hf.co/0xSero/GLM-5.1-444B
```

Support this work → · X · GitHub · REAP paper · Cerebras REAP

GLM-5.1-444B

REAP-pruned zai-org/GLM-5.1.

At a glance


Base model	zai-org/GLM-5.1
Format	BF16
Total params	444B
Active / token	14B
Experts / layer	154
Layers	78
Hidden size	6144
Context	202,752
On-disk size	910 GB

Which variant should I pick?

Variant	Format	Link
`GLM-5.1-444B` (this)	BF16	link
`GLM-5.1-444B-GGUF`	GGUF	link
`GLM-5.1-478B-NVFP4`	NVFP4	link
`GLM-5.1-555B`	BF16	link
`GLM-5.1-555B-GGUF`	GGUF	link
`GLM-5.1-555B-NVFP4`	NVFP4	link
`GLM-5.1-555B-W4A16`	W4A16	link

Use the 25% pruned version instead: 0xSero/GLM-5.1-555B

For GGUF: 0xSero/GLM-5.1-555B-GGUF

GLM-5.1 - 40% Expert Pruned (REAP) - BF16

This is a 40% expert-pruned version of zai-org/GLM-5.1 using REAP.

Property	Value
Base model	zai-org/GLM-5.1
Architecture	GlmMoeDsaForCausalLM
Routed experts	256 -> 154 (40% removed)
Active params/token	~14B (top-8 routing)
Precision	BF16

Known Issues

This model enters repetition loops on ~29% of test probes when generating long-form code or structured output. Affected tasks include:

Complex code generation (red-black trees, B-trees, chess engines, regex engines)
Structured output (comparison tables, API specs, enum lists)
LaTeX-heavy math

The root cause is that removing 40% of experts exceeds the model's pruning tolerance. The 25% pruned variant (192/256 experts) eliminates all repetition loops.

Sibling Models

Model	Prune %	Status
0xSero/GLM-5.1-555B	25%	Recommended
0xSero/GLM-5.1-555B-GGUF	25% Q4 GGUF	Recommended
This repo	40%	Has repetition issues
0xSero/GLM-5.1-444B-GGUF	40% Q4 GGUF	BROKEN

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}