GLM-5.2-NVFP4

NVFP4-quantized checkpoint of zai-org/GLM-5.2 (753B-param MoE with IndexShare sparse attention). Shrinks the BF16 checkpoint from ~1.37 TB to ~459 GB (≈3× smaller) so it fits on an 8-GPU Blackwell node (e.g. 8×96 GB) with room for long-context KV cache.

This is a community-built quantization of GLM-5.2 to NVIDIA's NVFP4 format (E2M1 + FP8 E4M3 scales, 16-element blocks), built using a per-shard streaming recipe derived from NVIDIA ModelOpt's NVFP4_EXPERTS_ONLY_CFG and TensorRT-LLM's DeepSeek-V3.2 precision strategy.

Note on footprint: this is a 753B-parameter model. Even at NVFP4 it is ~459 GB on disk and in VRAM (because attention, norms, embeddings, the router, MTP auxiliary heads, the indexer, and the first/last layers are deliberately kept in BF16/FP32 — see the precision table). It does not fit on a single GPU. Plan for a multi-GPU node with ≥ 6 GPUs for weights alone, and 8 GPUs in practice to leave headroom for KV cache at long context.

Format

Component Precision Notes
Embeddings, lm_head BF16 NVIDIA excludes
All *norm* / *layernorm* / *k_norm* / *q_norm* BF16 All norms stay BF16
Attention block (*.self_attn.*) BF16 Per DeepSeek-R1 recipe
Indexer weights_proj FP32 Per DeepSeek-V3.2 DSA recipe
Indexer low-rank (q_a, k_a) BF16 Per DeepSeek-V3.2 DSA recipe
Router / gate BF16 RouterGEMM uses BF16 inputs/weights
MTP auxiliary heads (eh_proj, enorm, hnorm, shared_head) BF16 GLM-5.2 IndexShare MTP module (in model.layers.78)
First 2 + last 2 layers (model.layers.{0,1,76,77}) BF16 Per DeepSeek-R1 boundary rule; layer 78+1 also captures the MTP head
Sparse experts (*.experts.{gate,up,down}_proj) NVFP4 Block-scaled FP4 — the bulk of the weights
Shared experts (*.shared_experts.*) BF16 Kept BF16 in this build

Everything else not listed: NVFP4 block-scaled FP4.

Architecture

  • Base model: GLM-5.2 (753B params, MoE, 78 transformer layers + 1 MTP layer at index 78, IndexShare sparse attention)
  • Quantization: NVFP4 (E2M1 + FP8 E4M3, 16-element block scales)
  • Block size: 16
  • Quant method: modelopt
  • Calibration: static per-block percentile-0.9999 scales (no forward-pass calibration — see Limitations)
  • On-disk size: ~459 GB (NVFP4 packed weights + FP8 scales + BF16/FP32 kept layers)
  • Compression: ~1.37 TB (BF16) → ~459 GB ≈ 3.0×

Hardware

  • Required: NVIDIA Blackwell GPUs (B200, GB200, or RTX PRO 6000 Blackwell). NVFP4 tensor cores are Blackwell-only.
  • VRAM for weights: ~459 GB → minimum 6× 96 GB GPUs just to hold weights; 8 GPUs recommended for KV cache headroom.
  • Tested config: single node, 8× RTX PRO 6000 Blackwell (96 GB each), tensor-parallel 8.
  • Does NOT fit on a single GPU.
  • Inference: TensorRT-LLM, vLLM, or SGLang with modelopt NVFP4 support.

Loading

vLLM (v0.23.0+)

from vllm import LLM, SamplingParams

llm = LLM(
    model="Lorbus/GLM-5.2-NVFP4",
    quantization="modelopt",
    kv_cache_dtype="fp8",
    tensor_parallel_size=8,   # needs the full 8-GPU node
    trust_remote_code=True,
    max_model_len=1_000_000,
)

SGLang (v0.5.13.post1+)

python3 -m sglang.launch_server \
    --model-path Lorbus/GLM-5.2-NVFP4 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8 \
    --tp 8 \
    --trust-remote-code \
    --port 8888

Transformers (v0.5.12+) / KTransformers (v0.5.12+)

Both frameworks now natively load modelopt NVFP4 checkpoints with trust_remote_code=True. See framework docs for details.

Methodology

This quantization was produced with a per-shard streaming pipeline that downloads GLM-5.2 shards one at a time from HuggingFace Hub, quantizes each tensor in isolation, and writes the result back. We do not load the full BF16 model into VRAM (1.37 TB BF16 wouldn't fit on a 768 GB GPU box), and we do not run forward-pass calibration for the same reason.

Quality techniques applied (vs NVIDIA's full ModelOpt recipe):

Technique NVIDIA full This build
E2M1 + FP8 block-scaled NVFP4 yes yes
Block size 16 yes yes
Mixed-precision routing (BF16 excludes) yes yes
FP32 indexer weights_proj yes yes
First/last N layers BF16 yes yes
Percentile (outlier-robust) scales yes yes
fp8_scale_sweep (search 128 FP8 scales) yes no (~0.5% est. loss)
local_hessian calibration yes no (~0.5% est. loss)
moe_calib_experts_ratio (all-expert forward) yes no (~1–2% est. loss for MoE)
Calibration forward passes on real data yes no (~1–3% est. loss)

Expected quality: estimated 92–96% of NVIDIA's full ModelOpt NVFP4 recipe. This is an estimate, not a measurement — see Limitations.

Limitations

  • No benchmark evaluations have been run. The 92–96% figure is an engineering estimate based on which calibration steps were skipped, not a measured score. Verify quality on your own downstream task before relying on it.
  • We cannot reproduce NVIDIA's full PTQ pipeline because GLM-5.2 BF16 (1.37 TB) does not fit in the 768 GB VRAM of the build box, and local_hessian / forward-pass calibration require loading the full model.
  • The IndexShare sparse-attention design is GLM-5.2-specific; to our knowledge this is the first published quantization applying the DSA-style precision recipe to it. The indexer handling is by name-pattern, not a verified arch-level analysis.
  • NVFP4 checkpoint support in serving frameworks is still marked experimental.

Reproducing

Build infrastructure:

  • 8× NVIDIA RTX PRO 6000 Blackwell (96 GB each), PCIe-only (no NVLink)
  • Streaming per-shard HF Hub download → per-tensor NVFP4 quant → write back
  • 4 quantization workers (one per GPU), ~5 hours wall time

Citation

If you use this quantization, please credit the original model and NVIDIA's NVFP4 work:

License

MIT (inherited from GLM-5.2).

Downloads last month
312
Safetensors
Model size
400B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Lorbus/GLM-5.2-NVFP4

Base model

zai-org/GLM-5.2
Quantized
(24)
this model

Papers for Lorbus/GLM-5.2-NVFP4