Instructions to use 0xSero/GLM-5.1-555B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xSero/GLM-5.1-555B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="0xSero/GLM-5.1-555B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-5.1-555B-NVFP4") model = AutoModelForCausalLM.from_pretrained("0xSero/GLM-5.1-555B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use 0xSero/GLM-5.1-555B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/GLM-5.1-555B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-5.1-555B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/0xSero/GLM-5.1-555B-NVFP4
- SGLang
How to use 0xSero/GLM-5.1-555B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0xSero/GLM-5.1-555B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-5.1-555B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0xSero/GLM-5.1-555B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-5.1-555B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use 0xSero/GLM-5.1-555B-NVFP4 with Docker Model Runner:
docker model run hf.co/0xSero/GLM-5.1-555B-NVFP4
Support this work → · X · GitHub · REAP paper · Cerebras REAP
GLM-5.1-555B-NVFP4
NVFP4 quantization of 0xSero/GLM-5.1-555B.
At a glance
| Base model | 0xSero/GLM-5.1-555B |
| Format | NVFP4 |
| Total params | 555B |
| Active / token | 14B |
| Experts / layer | 192 |
| Layers | 78 |
| Hidden size | 6144 |
| Context | 202,752 |
| On-disk size | 320 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
GLM-5.1-444B |
BF16 | link |
GLM-5.1-444B-GGUF |
GGUF | link |
GLM-5.1-478B-NVFP4 |
NVFP4 | link |
GLM-5.1-555B |
BF16 | link |
GLM-5.1-555B-GGUF |
GGUF | link |
GLM-5.1-555B-NVFP4 (this) |
NVFP4 | link |
GLM-5.1-555B-W4A16 |
W4A16 | link |
NVFP4 quantization of 0xSero/GLM-5.1-555B — a REAP-pruned variant of GLM-5.1 (192 experts per MoE layer, down from 256).
Target hardware: 8× RTX PRO 6000 Blackwell 96GB (sm120) via sglang. See deploy recipe below.
Model details
| Property | Value |
|---|---|
| Architecture | GlmMoeDsaForCausalLM (DeepSeek Sparse Attention + MLA) |
| Base precision | BF16 (source: 1.1 TB) |
| Quantization | NVFP4 (4-bit weights + FP8 per-group scales, group=16) |
| Output size | 320 GB (~3.4× compression) |
| Experts per MoE layer | 192 (REAP-pruned from 256) |
| Layers | 78 |
| Format | nvfp4-pack-quantized via compressed-tensors |
Layers kept in BF16 (per AutoRound ignore pattern)
lm_headmodel.layers.[0-2].mlp.{gate,up,down}_proj(first 3 layers' experts — most sensitive)model.layers.[0-77].self_attn.indexer.weights_proj(DSA indexer, quant-sensitive)
Deploy on sm120 (RTX PRO 6000 Blackwell)
Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.
docker run --gpus all --ipc=host --shm-size=8g --network=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v jit-cache:/cache/jit \
-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
-e SGLANG_ENABLE_DEEP_GEMM=0 \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e NCCL_IB_DISABLE=1 \
-e NCCL_P2P_LEVEL=SYS \
-e NCCL_MIN_NCHANNELS=8 \
voipmonitor/sglang:cu130 \
python3 -m sglang.launch_server \
--model-path 0xSero/GLM-5.1-555B-NVFP4 \
--served-model-name glm-5.1-reap \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--tensor-parallel-size 8 \
--quantization compressed-tensors \
--kv-cache-dtype bf16 \
--trust-remote-code \
--mem-fraction-static 0.85 \
--chunked-prefill-size 16384 \
--attention-backend flashinfer \
--fp4-gemm-backend b12x \
--moe-runner-backend b12x \
--host 0.0.0.0 --port 5000
Critical flags:
--kv-cache-dtype bf16— mandatory; fp8_e4m3 produces garbled output on sm120--attention-backend flashinfer— sm120-compatible (trtllm_mha, flashmla are not)SGLANG_ENABLE_DEEP_GEMM=0— DeepGEMM needs WGMMA/TCGEN05 absent on sm120
Memory fit: 320 GB weights + KV cache fits on 8× 96GB (≈768 GB total VRAM). Minimum viable: 6× RTX PRO 6000 with --tp 2 --pp 3.
Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.
Quantization method
Produced via AutoRound 0.12.2 layerwise mode on 8× H100 80GB.
Settings
| Setting | Value | Notes |
|---|---|---|
--scheme |
NVFP4 | 4-bit weights + FP8 per-group scales |
--iters |
50 | Halved from default 200 (loss trajectory confirms iters 100+ produce negligible improvement) |
--nsamples |
512 | Calibration samples |
--seqlen |
2048 | Default (seqlen=4096 tried; most samples too short after tokenization) |
--batch_size |
8 | Default |
--low_gpu_mem_usage |
true | Required for 1.1TB source on 640GB VRAM |
--format |
auto_round:llm_compressor | Produces compressed-tensors (sglang/vLLM compatible) |
Calibration dataset
Custom mix targeting realistic use cases (1,190 samples total → 505 valid after packing):
| Source | Samples | Content |
|---|---|---|
| 0xSero/structured-outputs-calibration-v1 | 430 | JSON schemas, sharegpt-JSON, Mermaid diagrams |
| 0xSero/reap-calibration-data-v1 | 560 | 100 long_context + 120 function_calling + 100 agentic + 60 coding + 40 cuda + 30 reasoning + 30 math + 40 terminal + 40 cybersecurity |
| NeelNanda/pile-10k | 200 | General web text (distribution anchor; provides long samples to compensate for short custom samples) |
Multi-dataset loading used AutoRound's :concat=true option (patched during build; upstreamable) to pack short instruction samples into full-seqlen sequences.
Wall time
- Model load + offload: ~55 min
- Calibration + quant: 6h 34m
- Save: 7 min
- Total: ~7.5 hours on 8× H100 80GB (brev compute)
Quality characteristics
Layer-level loss (iter 0 → iter 49) trajectory:
| Layer depth | iter 0 loss | iter 49 loss | Behavior |
|---|---|---|---|
| 0-2 | 0 | 0 | Attention-only; MLP skipped |
| 3-9 | 1e-6 to 1e-5 | 1e-6 to 1e-5 | Iterative tuning minimal effect |
| 10-30 | 1e-4 to 1e-2 | 30-50% reduction | Sign-tuning active |
| 31-55 | 1e-2 to 1e-1 | 20-30% reduction | Accumulating |
| 56-77 | 1e-1 to 8e-1 | 10-20% reduction | Deep-layer drift |
Expected quality impact: benchmarks on sm120 recommended to measure MMLU/GSM8K/IFEval gap vs BF16 source. Loss magnitudes alone suggest non-trivial degradation at deep layers; whether this matters in practice depends on task.
Provenance
- Source model: 0xSero/GLM-5.1-555B (BF16, 1.1 TB, 26 safetensors)
- Quantization compute: Nebius H100×8 via brev
- Quant tool: Intel AutoRound 0.12.2
- Deploy tool: voipmonitor/sglang:cu130
License
MIT (inherits from base model).
Acknowledgements
- Cerebras REAP team for the pruning recipe
- voipmonitor for the sm120 sglang deployment guide
- Intel AutoRound team for the quantization toolkit
- Nebius for the H100 compute
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
- Downloads last month
- 260