Instructions to use 0xSero/GLM-4.7-218B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/GLM-4.7-218B-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/GLM-4.7-218B-W4A16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-218B-W4A16")
model = AutoModelForCausalLM.from_pretrained("0xSero/GLM-4.7-218B-W4A16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 0xSero/GLM-4.7-218B-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/GLM-4.7-218B-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-4.7-218B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/GLM-4.7-218B-W4A16

SGLang

How to use 0xSero/GLM-4.7-218B-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/GLM-4.7-218B-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-4.7-218B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/GLM-4.7-218B-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-4.7-218B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/GLM-4.7-218B-W4A16 with Docker Model Runner:
```
docker model run hf.co/0xSero/GLM-4.7-218B-W4A16
```

Support this work → · X · GitHub · REAP paper · Cerebras REAP

GLM-4.7-218B-W4A16

W4A16 quantization of zai-org/GLM-4.7.

At a glance


Base model	cerebras/GLM-4.7-REAP-218B-A32B
Format	W4A16
Total params	218B
Active / token	32B
Experts / layer	96
Layers	92
Hidden size	5120
Context	202,752
On-disk size	116 GB

Which variant should I pick?

Variant	Format	Link
`GLM-4.7-185B`	BF16	link
`GLM-4.7-185B-W4A16`	W4A16	link
`GLM-4.7-202B`	BF16	link
`GLM-4.7-218B-W4A16` (this)	W4A16	link
`GLM-4.7-REAP-40-W4A16`	W4A16	link

40% Expert-Pruned + INT4 Quantized GLM-4 (218B total / 32B active params, ~116GB)

A highly compressed version of GLM-4.7 combining REAP expert pruning (40% experts removed) with INT4 weight quantization (AutoRound W4A16). This model is ~6.5x smaller than the original 700GB GLM-4.7.

Model Details

Property	Value
Base Model	GLM-4.7-REAP-218B-A32B
Original (GLM-4.7)	358B params, ~717GB
After REAP Pruning	218B params, ~407GB
After W4A16 Quant	218B params, ~108GB
Active Parameters	32B per forward pass
Total Compression	~6.5x from original
Quantization	INT4 weights, FP16 activations
Group Size	128
Format	AutoRound
VRAM Required	~110GB

Compression Pipeline

GLM-4.7 (358B, 700GB)
        |
        v  REAP 40% pruning (96/160 experts)
        |
GLM-4.7-REAP-218B-A32B (218B, 407GB)
        |
        v  AutoRound W4A16 quantization
        |
GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB)  <-- This model

Total: 6.5x compression

Usage

📊 Benchmarks

Tested on 8x RTX 3090:

Metric	Value
Prefill	375 tps
Generation	38.5
Time to First Token	3.82s

Deployment

vLLM

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve GLM-4.7-REAP-218B-A32B-W4A16 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --max-model-len 165000 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8_e4m3 \
  --tool-call-parser glm47 \
  --served-model-name glm-4.7 \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

AutoRound Quantization Details

AutoRound is Intel's weight quantization method using signed gradient descent.

bits: 4
group_size: 128
format: auto_round
nsamples: 64
seqlen: 512
dataset: NeelNanda/pile-10k

Reproduce This Model

# 1. Download the BF16 REAP model
huggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B

# 2. Run AutoRound quantization
pip install auto-round

python -c "
from auto_round import AutoRound
ar = AutoRound(
    './GLM-4.7-REAP-218B-A32B',
    device='cuda',
    device_map='auto',
    nsamples=64,
    seqlen=512,
    batch_size=1
)
ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')
"

# Takes ~2 hours on 8x H200

Related Models

Model	Params	Size	Format	Link
GLM-4.7 (Base)	358B	~700GB	BF16	zai-org/GLM-4.7
GLM-4.7-REAP-218B-A32B	218B	~407GB	BF16	0xSero/GLM-4.7-REAP-218B-A32B
This Model	218B	~108GB	W4A16	-

Benchmarks

Benchmarks in progress

Benchmark	GLM-4.7 Base	REAP BF16	REAP W4A16
HumanEval	-	-	-
MBPP	-	-	-
GSM8K	-	-	-

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}