Support this work → · X · GitHub · REAP paper · Cerebras REAP

GLM-5.1-555B-NVFP4

NVFP4 quantization of 0xSero/GLM-5.1-555B.

At a glance

Base model 0xSero/GLM-5.1-555B
Format NVFP4
Total params 555B
Active / token 14B
Experts / layer 192
Layers 78
Hidden size 6144
Context 202,752
On-disk size 320 GB

Which variant should I pick?

Variant Format Link
GLM-5.1-444B BF16 link
GLM-5.1-444B-GGUF GGUF link
GLM-5.1-478B-NVFP4 NVFP4 link
GLM-5.1-555B BF16 link
GLM-5.1-555B-GGUF GGUF link
GLM-5.1-555B-NVFP4 (this) NVFP4 link
GLM-5.1-555B-W4A16 W4A16 link

NVFP4 quantization of 0xSero/GLM-5.1-555B — a REAP-pruned variant of GLM-5.1 (192 experts per MoE layer, down from 256).

Target hardware: 8× RTX PRO 6000 Blackwell 96GB (sm120) via sglang. See deploy recipe below.

Model details

Property Value
Architecture GlmMoeDsaForCausalLM (DeepSeek Sparse Attention + MLA)
Base precision BF16 (source: 1.1 TB)
Quantization NVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size 320 GB (~3.4× compression)
Experts per MoE layer 192 (REAP-pruned from 256)
Layers 78
Format nvfp4-pack-quantized via compressed-tensors

Layers kept in BF16 (per AutoRound ignore pattern)

  • lm_head
  • model.layers.[0-2].mlp.{gate,up,down}_proj (first 3 layers' experts — most sensitive)
  • model.layers.[0-77].self_attn.indexer.weights_proj (DSA indexer, quant-sensitive)

Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

docker run --gpus all --ipc=host --shm-size=8g --network=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v jit-cache:/cache/jit \
  -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
  -e SGLANG_ENABLE_DEEP_GEMM=0 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_MIN_NCHANNELS=8 \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path 0xSero/GLM-5.1-555B-NVFP4 \
    --served-model-name glm-5.1-reap \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --tensor-parallel-size 8 \
    --quantization compressed-tensors \
    --kv-cache-dtype bf16 \
    --trust-remote-code \
    --mem-fraction-static 0.85 \
    --chunked-prefill-size 16384 \
    --attention-backend flashinfer \
    --fp4-gemm-backend b12x \
    --moe-runner-backend b12x \
    --host 0.0.0.0 --port 5000

Critical flags:

  • --kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
  • --attention-backend flashinfer — sm120-compatible (trtllm_mha, flashmla are not)
  • SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 320 GB weights + KV cache fits on 8× 96GB (≈768 GB total VRAM). Minimum viable: 6× RTX PRO 6000 with --tp 2 --pp 3.

Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.

Quantization method

Produced via AutoRound 0.12.2 layerwise mode on 8× H100 80GB.

Settings

Setting Value Notes
--scheme NVFP4 4-bit weights + FP8 per-group scales
--iters 50 Halved from default 200 (loss trajectory confirms iters 100+ produce negligible improvement)
--nsamples 512 Calibration samples
--seqlen 2048 Default (seqlen=4096 tried; most samples too short after tokenization)
--batch_size 8 Default
--low_gpu_mem_usage true Required for 1.1TB source on 640GB VRAM
--format auto_round:llm_compressor Produces compressed-tensors (sglang/vLLM compatible)

Calibration dataset

Custom mix targeting realistic use cases (1,190 samples total → 505 valid after packing):

Source Samples Content
0xSero/structured-outputs-calibration-v1 430 JSON schemas, sharegpt-JSON, Mermaid diagrams
0xSero/reap-calibration-data-v1 560 100 long_context + 120 function_calling + 100 agentic + 60 coding + 40 cuda + 30 reasoning + 30 math + 40 terminal + 40 cybersecurity
NeelNanda/pile-10k 200 General web text (distribution anchor; provides long samples to compensate for short custom samples)

Multi-dataset loading used AutoRound's :concat=true option (patched during build; upstreamable) to pack short instruction samples into full-seqlen sequences.

Wall time

  • Model load + offload: ~55 min
  • Calibration + quant: 6h 34m
  • Save: 7 min
  • Total: ~7.5 hours on 8× H100 80GB (brev compute)

Quality characteristics

Layer-level loss (iter 0 → iter 49) trajectory:

Layer depth iter 0 loss iter 49 loss Behavior
0-2 0 0 Attention-only; MLP skipped
3-9 1e-6 to 1e-5 1e-6 to 1e-5 Iterative tuning minimal effect
10-30 1e-4 to 1e-2 30-50% reduction Sign-tuning active
31-55 1e-2 to 1e-1 20-30% reduction Accumulating
56-77 1e-1 to 8e-1 10-20% reduction Deep-layer drift

Expected quality impact: benchmarks on sm120 recommended to measure MMLU/GSM8K/IFEval gap vs BF16 source. Loss magnitudes alone suggest non-trivial degradation at deep layers; whether this matters in practice depends on task.

Provenance

License

MIT (inherits from base model).

Acknowledgements

  • Cerebras REAP team for the pruning recipe
  • voipmonitor for the sm120 sglang deployment guide
  • Intel AutoRound team for the quantization toolkit
  • Nebius for the H100 compute

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
260
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.1-555B-NVFP4

Base model

zai-org/GLM-5.1
Quantized
(2)
this model

Collection including 0xSero/GLM-5.1-555B-NVFP4

Paper for 0xSero/GLM-5.1-555B-NVFP4