- GLM-5.1-478B-NVFP4
- At a glance
- Which variant should I pick?
- Measured performance
- Quick start (4× 96 GB Blackwell)
- Reference rig
- Exact versions (pinned from the running venv)
- Required sglang patch (SM120 only)
- Launch
- IndexCache (enabled by default)
- Key flag decisions (why these specific values)
- MTP / NEXTN speculative decode (optional)
- Sampling recommendations
- Lineage & license
- License & citation
- Sponsors
- At a glance
Support this work → · X · GitHub · REAP paper · Cerebras REAP
GLM-5.1-478B-NVFP4
NVFP4 quantization of zai-org/GLM-5.1.
At a glance
| Base model | zai-org/GLM-5.1 |
| Format | NVFP4 |
| Total params | 478B |
| Active / token | 42B |
| Experts / layer | 160 |
| Layers | 78 |
| Hidden size | 6144 |
| Context | 202,752 |
| On-disk size | 306 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
GLM-5.1-444B |
BF16 | link |
GLM-5.1-444B-GGUF |
GGUF | link |
GLM-5.1-478B-NVFP4 (this) |
NVFP4 | link |
GLM-5.1-555B |
BF16 | link |
GLM-5.1-555B-GGUF |
GGUF | link |
GLM-5.1-555B-NVFP4 |
NVFP4 | link |
GLM-5.1-555B-W4A16 |
W4A16 | link |
NVFP4 quantization of zai-org/GLM-5.1, further REAP-pruned from 256 → 160 routed experts per MoE layer. Tuned to run at 200,000-token context on a 4× 96 GB Blackwell workstation.
| Total params | 478.4B |
| Activated / token | 42.7B |
| Routed experts / MoE layer | 160 (was 256 in base) |
| Active experts / token | 8 routed + 1 shared |
| Layers | 78 (3 dense + 75 MoE) + 1 MTP / NEXTN |
| Hidden size | 6144 |
| Attention | MLA-DSA, 64 heads |
| Max position | 202,752 |
| Quantization | NVFP4, group_size=16 (modelopt_fp4) |
| On-disk size | 285 GB (85 shards) |
| License | MIT (inherited from GLM-5.1) |
Measured performance
Single-user, batch size 1, decode tok/s at various prompt lengths on our reference rig (baseline dense-MLA path; see IndexCache below for substantially faster long-context numbers):
| Context | tok/s (baseline) | tok/s (+ IndexCache) |
|---|---|---|
| 256 | 46.5 | 46.4 |
| 4 k | 41.8 | 41.7 |
| 16 k | 26.4 – 38.6 | 40.9 |
| 32 k | — | 36.5 |
| 64 k | — | 29.5 |
| 128 k | — | 21.3 |
| 150 k – 165 k | 22.4 | 18.4 |
Under live mixed traffic (1,495 decode samples, baseline config):
| Context range | p50 tok/s |
|---|---|
| < 1 k | 42.7 |
| 1 – 8 k | 44.3 |
| 8 – 32 k | 36.3 |
| 32 – 100 k | 27.7 |
Per-rank VRAM at 202,752 ctx: weights 77.2 GB, KV pool 11.3 GB (270 k tokens), CUDA graphs 0.3 GB, ~5 GB free.
Quick start (4× 96 GB Blackwell)
# 1. Download the weights
hf download 0xSero/GLM-5.1-478B-NVFP4 --local-dir ./GLM-5.1-478B-A42B-REAP-NVFP4
# 2. Install the pinned inference stack (see "Exact versions" below)
python3.12 -m venv venv && source venv/bin/activate
pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3
# 3. Apply the required NSA-disable patch (see "Required sglang patch" below)
# 4. Launch
./launch.sh # see full script below
Reference rig
- 4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96 GB, compute capability 12.0 (sm_120)
- NVIDIA driver 580.126.18, CUDA 12.9 userspace
- Ubuntu / Pop!_OS 22.04, Python 3.12
This is what the tuning targets. The same recipe works on 4× B200 (sm_100), 8× Hopper (sm_90) with fewer or more aggressive quantization, and other Blackwell configurations — see the hardware compatibility matrix at the bottom of this page.
Exact versions (pinned from the running venv)
Everything below is reproducible from:
pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3
The resolver pulls in the whole stack at these versions:
sglang 0.5.10.post1
torch 2.9.1+cu129
triton 3.5.1
transformers 5.3.0
tokenizers 0.22.2
safetensors 0.8.0rc0
numpy 2.4.4
flashinfer-python 0.6.7.post3
flashinfer-cubin 0.6.7.post3
nvidia-cutlass-dsl 4.5.0.dev0
nvidia-cublas-cu12 12.9.1.4
nvidia-cudnn-cu12 9.10.2.21
nvidia-nccl-cu12 2.27.5
nvidia-cuda-nvrtc-cu12 12.9.86
nvidia-cuda-runtime-cu12 12.9.79
nvidia-nvjitlink-cu12 12.9.86
nvidia-nvshmem-cu12 3.3.20
Verify:
python -c "import sglang, torch, flashinfer; print(sglang.__version__, torch.__version__, flashinfer.__version__)"
# 0.5.10.post1 2.9.1+cu129 0.6.7.post3
Required sglang patch (SM120 only)
GLM-5.1's config advertises GlmMoeDsaForCausalLM, which sglang routes through DeepSeek Sparse Attention by default. Every NSA backend in sglang 0.5.10.post1 is built for sm_90a / sm_100f only and fails at launch on sm_120. Route GLM-5 through the stable dense-MLA path by excluding it from the NSA architecture list:
Edit <venv>/lib/python3.12/site-packages/sglang/srt/configs/model_config.py, function is_deepseek_nsa():
def is_deepseek_nsa(config) -> bool:
architectures = (
config.get("architectures") if isinstance(config, dict)
else getattr(config, "architectures", None)
)
index_topk = (
config.get("index_topk") if isinstance(config, dict)
else getattr(config, "index_topk", None)
)
# Keep GLM-5 on dense MLA until sm_120 NSA kernels ship.
return (
architectures is not None
and architectures[0] in [
"DeepseekV3ForCausalLM",
"DeepseekV32ForCausalLM",
"DeepseekV3ForCausalLMNextN",
"MistralLarge3ForCausalLM",
"PixtralForConditionalGeneration",
]
and index_topk is not None
)
(Only the architectures list changes — GlmMoeDsaForCausalLM is removed.)
After the patch, sglang auto-picks triton for attention on sm_120. Confirm in the startup log: attention_backend='triton'.
On sm_90 (Hopper) and sm_100 (B200) this patch is not needed — the native NSA kernels work. Skip to the launch section.
Launch
#!/usr/bin/env bash
set -euo pipefail
MODEL=/path/to/GLM-5.1-478B-A42B-REAP-NVFP4
VENV=/path/to/sglang-venv
# Route NCCL over PCIe (no NVLink on workstation Blackwell)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1,2,3 # four Blackwell GPUs
# DeepGEMM has no sm_120 kernels; keep it off.
export SGLANG_ENABLE_JIT_DEEPGEMM=0
export SGLANG_ENABLE_DEEP_GEMM=0
export SGLANG_DISABLE_DEEP_GEMM=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=True
export FLASHINFER_DISABLE_VERSION_CHECK=1
# NCCL tuning for PCIe-only (no IB, no NVLink)
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_P2P_LEVEL=PIX
export NCCL_SHM_DISABLE=0
export NCCL_BUFFSIZE=4194304
export NCCL_MIN_NCHANNELS=8
export NCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export NCCL_CUMEM_HOST_ENABLE=0
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1
exec "$VENV/bin/python" -m sglang.launch_server \
--model-path "$MODEL" \
--served-model-name GLM-5.1-478B-A42B-REAP-NVFP4 \
--host 0.0.0.0 --port 8000 \
--trust-remote-code \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--context-length 202752 \
--max-running-requests 1 \
--mem-fraction-static 0.94 \
--chunked-prefill-size 4096 \
--page-size 128 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e4m3 \
--triton-attention-num-kv-splits 64 \
--moe-runner-backend cutlass \
--fp4-gemm-backend flashinfer_cudnn \
--cuda-graph-max-bs 4 \
--pre-warm-nccl \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--chat-template "$MODEL/chat_template.jinja" \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
--watchdog-timeout 1800 \
--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'
The final --json-model-override-args flag enables IndexCache, which reuses the DSA indexer output across layers according to a 78-character F/S pattern. This is now the default shipped recipe — for details, rationale, and the pattern breakdown see the IndexCache section below. If you want to disable it, remove that one flag.
On sm_90 / sm_100 you will want --attention-backend flashinfer and --fp4-gemm-backend b12x instead — see the 555B sibling card for that recipe.
IndexCache (enabled by default)
The launch script above already includes IndexCache via --json-model-override-args. This section documents what the flag does and how it was measured.
GLM-5.1's DeepSeek Sparse Attention (DSA) recomputes the top-k sparse index at every layer, on every prefill chunk and every decode step. Profiling in sglang#21663 measured the indexer alone at ~81% of prefill wall time at 200k context. IndexCache reuses the top-k indices across layers according to a 78-character F/S pattern — one character per transformer layer: F = that layer runs its own indexer, S = that layer reuses the nearest upstream F layer's indices. The pattern used here was greedy-searched upstream and is the one shipped in the SGLang GLM-5.1 cookbook.
With 23 F and 55 S layers, only ~30% of layers run the indexer — the other 70% of indexer cost is skipped. No extra VRAM (indices are transient, live on the MLA KV pool only for the duration of one step).
The flag in the launch script above:
--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'
To fall back to the pre-IndexCache baseline (e.g., for A/B testing), remove this single flag.
Measured on the reference rig (max_tokens=100, bs=1, streaming, 1 warmup + 2 measured reps):
| Prompt tokens | TTFT (warm) | Cold-prefill tok/s | Decode tok/s |
|---|---|---|---|
| 16,025 | 0.23 s | — | 40.9 |
| 32,025 | 0.39 s | — | 36.5 |
| 64,025 | 0.71 s | — | 29.5 |
| 128,025 | 1.34 s | 4,629 | 21.3 |
| 165,025 | 1.76 s | — | 18.4 |
Per-layer F/S pattern for reference (layer 0 leftmost):
layer 0000000001111111111222222222233333333334444444444555555555566666666667777777777
idx 0123456789012345678901234567890123456789012345678901234567890123456789012345678
mask FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS
Coherence spot-checked with needle-in-haystack at 32k and 100k, 11-turn tool-use chat, and GSM-style arithmetic — no quality regressions observed relative to baseline.
Mutual exclusion: IndexCache and MTP are not both enabled in this recipe. MTP requires --page-size 64 and caps context at 65,536 (bf16 KV); IndexCache keeps --page-size 128, fp8_e4m3 KV, and the full 202,752-token window. For the workstation 200k-ctx target, IndexCache is the right pick.
Key flag decisions (why these specific values)
These were measured on the reference rig; defaults were not.
--triton-attention-num-kv-splits 64 — biggest single win. Default is 8. At bs=1 decode on sm_120, raising kv-splits gave:
| Context | splits=8 | splits=64 |
|---|---|---|
| 4 k | 39.7 | 41.8 |
| 16 k | 26.4 | 38.6 |
| 150 k | 5.2 | 22.4 |
Coherence verified across arithmetic, factual recall, needle-in-haystack @ 32 k and @ 100 k, and 11-turn chat.
--mem-fraction-static 0.94 — decode is kernel-bound at bs=1, not memory-bound. 0.94 vs 0.97 gives identical tok/s and ~5 GB/rank of headroom for graph recapture and prefill scratch.
--kv-cache-dtype fp8_e4m3 — halves KV memory vs bf16. Required to fit 202 k context in budget.
--attention-backend is intentionally omitted — sglang auto-selects triton on sm_120 for this architecture after the NSA patch. Flashinfer attention is skipped because it requires PCIe P2P atomics not available on the workstation board.
--page-size 128 — the non-MTP default. Drop to 64 only if enabling speculative decode.
MTP / NEXTN speculative decode (optional)
The checkpoint includes an MTP head for layer 78, stitched from the original 256-expert source using the layer-77 REAP keep-map as a proxy.
| Without MTP (this page's default) | With MTP | |
|---|---|---|
| Decode tok/s (short) | ~46 | ~90 (1.93×) |
| Max context | 202,752 | ~65,536 |
| KV dtype | fp8_e4m3 | bf16 (required by NEXTN) |
| Page size | 128 | 64 (required by NEXTN) |
MTP is opt-in because the workstation target is long context, not peak short-prompt throughput. Enable with:
# Replace three lines in the launch script:
--context-length 65536 \
--page-size 64 \
--kv-cache-dtype auto \
# and add:
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-attention-mode decode \
--speculative-moe-runner-backend cutlass
Also drop --mem-fraction-static to 0.88 — the draft worker adds ~5 GB/rank.
Sampling recommendations
General chat / reasoning:
temperature=0.5 top_p=0.95 frequency_penalty=0.3 repetition_penalty=1.05
Strict-answer (MCQ, tool-use benchmarks):
temperature=0.0 repetition_penalty=1.05
Keep repetition_penalty=1.05 everywhere. Pure greedy with no penalty can loop on pathological low-entropy prompts (e.g., repeated filler tokens).
Lineage & license
zai-org/GLM-5.1 (official, 744B bf16, 256 experts, MIT)
│
├── community NVFP4 quantization via NVIDIA Model Optimizer
│ (e.g. lukealonso/GLM-5.1-NVFP4, ~434 GB, 256 experts)
│
├── Local REAP pass 1: 256 → 192 experts
│ 0xSero/GLM-5.1-555B-NVFP4
│
└── Local REAP pass 2: 192 → 160 experts
0xSero/GLM-5.1-478B-NVFP4 ← this model
Both REAP passes were done locally using pooled token-weighted observations from:
0xSero/glm51-layerwise-reap-observations— per-block metrics, full layer coverage.0xSero/glm-5-special— consolidated observer state, ~85 M tokens over ~7.6 k samples.
Prune scripts and MTP-stitch script are in the repo tree.
License: MIT, inherited from zai-org/GLM-5.1.
Citation (REAP method):
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025},
eprint = {2510.13999},
archivePrefix = {arXiv},
}
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
- Downloads last month
- 3,501
Model tree for 0xSero/GLM-5.1-478B-NVFP4
Base model
zai-org/GLM-5.1