Instructions to use 0xSero/GLM-5.1-555B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/GLM-5.1-555B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/GLM-5.1-555B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-5.1-555B-NVFP4")
model = AutoModelForCausalLM.from_pretrained("0xSero/GLM-5.1-555B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 0xSero/GLM-5.1-555B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/GLM-5.1-555B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-555B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/GLM-5.1-555B-NVFP4

SGLang

How to use 0xSero/GLM-5.1-555B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/GLM-5.1-555B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-555B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/GLM-5.1-555B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-5.1-555B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/GLM-5.1-555B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/0xSero/GLM-5.1-555B-NVFP4
```

Support this work → · X · GitHub · REAP paper · Cerebras REAP

GLM-5.1-555B-NVFP4

NVFP4 quantization of 0xSero/GLM-5.1-555B.

At a glance


Base model	0xSero/GLM-5.1-555B
Format	NVFP4
Total params	555B
Active / token	14B
Experts / layer	192
Layers	78
Hidden size	6144
Context	202,752
On-disk size	320 GB

Which variant should I pick?

Variant	Format	Link
`GLM-5.1-444B`	BF16	link
`GLM-5.1-444B-GGUF`	GGUF	link
`GLM-5.1-478B-NVFP4`	NVFP4	link
`GLM-5.1-555B`	BF16	link
`GLM-5.1-555B-GGUF`	GGUF	link
`GLM-5.1-555B-NVFP4` (this)	NVFP4	link
`GLM-5.1-555B-W4A16`	W4A16	link

NVFP4 quantization of 0xSero/GLM-5.1-555B — a REAP-pruned variant of GLM-5.1 (192 experts per MoE layer, down from 256).

Target hardware: 8× RTX PRO 6000 Blackwell 96GB (sm120) via sglang. See deploy recipe below.

Model details

Property	Value
Architecture	`GlmMoeDsaForCausalLM` (DeepSeek Sparse Attention + MLA)
Base precision	BF16 (source: 1.1 TB)
Quantization	NVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size	320 GB (~3.4× compression)
Experts per MoE layer	192 (REAP-pruned from 256)
Layers	78
Format	`nvfp4-pack-quantized` via `compressed-tensors`

Layers kept in BF16 (per AutoRound ignore pattern)

lm_head
model.layers.[0-2].mlp.{gate,up,down}_proj (first 3 layers' experts — most sensitive)
model.layers.[0-77].self_attn.indexer.weights_proj (DSA indexer, quant-sensitive)

Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

docker run --gpus all --ipc=host --shm-size=8g --network=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v jit-cache:/cache/jit \
  -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
  -e SGLANG_ENABLE_DEEP_GEMM=0 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_MIN_NCHANNELS=8 \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path 0xSero/GLM-5.1-555B-NVFP4 \
    --served-model-name glm-5.1-reap \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --tensor-parallel-size 8 \
    --quantization compressed-tensors \
    --kv-cache-dtype bf16 \
    --trust-remote-code \
    --mem-fraction-static 0.85 \
    --chunked-prefill-size 16384 \
    --attention-backend flashinfer \
    --fp4-gemm-backend b12x \
    --moe-runner-backend b12x \
    --host 0.0.0.0 --port 5000

Critical flags:

--kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
--attention-backend flashinfer — sm120-compatible (trtllm_mha, flashmla are not)
SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 320 GB weights + KV cache fits on 8× 96GB (≈768 GB total VRAM). Minimum viable: 6× RTX PRO 6000 with --tp 2 --pp 3.

Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.

Quantization method

Produced via AutoRound 0.12.2 layerwise mode on 8× H100 80GB.

Settings

Setting	Value	Notes
`--scheme`	NVFP4	4-bit weights + FP8 per-group scales
`--iters`	50	Halved from default 200 (loss trajectory confirms iters 100+ produce negligible improvement)
`--nsamples`	512	Calibration samples
`--seqlen`	2048	Default (seqlen=4096 tried; most samples too short after tokenization)
`--batch_size`	8	Default
`--low_gpu_mem_usage`	true	Required for 1.1TB source on 640GB VRAM
`--format`	auto_round:llm_compressor	Produces compressed-tensors (sglang/vLLM compatible)

Calibration dataset

Custom mix targeting realistic use cases (1,190 samples total → 505 valid after packing):

Source	Samples	Content
0xSero/structured-outputs-calibration-v1	430	JSON schemas, sharegpt-JSON, Mermaid diagrams
0xSero/reap-calibration-data-v1	560	100 long_context + 120 function_calling + 100 agentic + 60 coding + 40 cuda + 30 reasoning + 30 math + 40 terminal + 40 cybersecurity
NeelNanda/pile-10k	200	General web text (distribution anchor; provides long samples to compensate for short custom samples)

Multi-dataset loading used AutoRound's :concat=true option (patched during build; upstreamable) to pack short instruction samples into full-seqlen sequences.

Wall time

Model load + offload: ~55 min
Calibration + quant: 6h 34m
Save: 7 min
Total: ~7.5 hours on 8× H100 80GB (brev compute)

Quality characteristics

Layer-level loss (iter 0 → iter 49) trajectory:

Layer depth	iter 0 loss	iter 49 loss	Behavior
0-2	0	0	Attention-only; MLP skipped
3-9	1e-6 to 1e-5	1e-6 to 1e-5	Iterative tuning minimal effect
10-30	1e-4 to 1e-2	30-50% reduction	Sign-tuning active
31-55	1e-2 to 1e-1	20-30% reduction	Accumulating
56-77	1e-1 to 8e-1	10-20% reduction	Deep-layer drift

Expected quality impact: benchmarks on sm120 recommended to measure MMLU/GSM8K/IFEval gap vs BF16 source. Loss magnitudes alone suggest non-trivial degradation at deep layers; whether this matters in practice depends on task.

Provenance

Source model: 0xSero/GLM-5.1-555B (BF16, 1.1 TB, 26 safetensors)
Quantization compute: Nebius H100×8 via brev
Quant tool: Intel AutoRound 0.12.2
Deploy tool: voipmonitor/sglang:cu130

License

MIT (inherits from base model).

Acknowledgements

Cerebras REAP team for the pruning recipe
voipmonitor for the sm120 sglang deployment guide
Intel AutoRound team for the quantization toolkit
Nebius for the H100 compute

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Model tree for 0xSero/GLM-5.1-555B-NVFP4

Base model

zai-org/GLM-5.1

Finetuned

0xSero/GLM-5.1-555B

Quantized

(2)

this model

Collection including 0xSero/GLM-5.1-555B-NVFP4

GLM — REAP

Collection

REAP-pruned & quantized GLM-4.x / 5 / 5.1 (+ Flash fine-tunes). • 17 items • Updated 2 days ago

Paper for 0xSero/GLM-5.1-555B-NVFP4

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20

0xSero
/

GLM-5.1-555B-NVFP4