Instructions to use Lorbus/GLM-5.2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Lorbus/GLM-5.2-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Lorbus/GLM-5.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Lorbus/GLM-5.2-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("Lorbus/GLM-5.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Lorbus/GLM-5.2-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Lorbus/GLM-5.2-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lorbus/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Lorbus/GLM-5.2-NVFP4
- SGLang
How to use Lorbus/GLM-5.2-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Lorbus/GLM-5.2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lorbus/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Lorbus/GLM-5.2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lorbus/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Lorbus/GLM-5.2-NVFP4 with Docker Model Runner:
docker model run hf.co/Lorbus/GLM-5.2-NVFP4
GLM-5.2-NVFP4
NVFP4-quantized checkpoint of zai-org/GLM-5.2 (753B-param MoE with IndexShare sparse attention). Shrinks the BF16 checkpoint from ~1.37 TB to ~459 GB (≈3× smaller) so it fits on an 8-GPU Blackwell node (e.g. 8×96 GB) with room for long-context KV cache.
This is a community-built quantization of GLM-5.2 to NVIDIA's NVFP4 format
(E2M1 + FP8 E4M3 scales, 16-element blocks), built using a per-shard streaming
recipe derived from NVIDIA ModelOpt's NVFP4_EXPERTS_ONLY_CFG and
TensorRT-LLM's DeepSeek-V3.2 precision strategy.
Note on footprint: this is a 753B-parameter model. Even at NVFP4 it is ~459 GB on disk and in VRAM (because attention, norms, embeddings, the router, MTP auxiliary heads, the indexer, and the first/last layers are deliberately kept in BF16/FP32 — see the precision table). It does not fit on a single GPU. Plan for a multi-GPU node with ≥ 6 GPUs for weights alone, and 8 GPUs in practice to leave headroom for KV cache at long context.
Format
| Component | Precision | Notes |
|---|---|---|
| Embeddings, lm_head | BF16 | NVIDIA excludes |
All *norm* / *layernorm* / *k_norm* / *q_norm* |
BF16 | All norms stay BF16 |
Attention block (*.self_attn.*) |
BF16 | Per DeepSeek-R1 recipe |
Indexer weights_proj |
FP32 | Per DeepSeek-V3.2 DSA recipe |
| Indexer low-rank (q_a, k_a) | BF16 | Per DeepSeek-V3.2 DSA recipe |
| Router / gate | BF16 | RouterGEMM uses BF16 inputs/weights |
MTP auxiliary heads (eh_proj, enorm, hnorm, shared_head) |
BF16 | GLM-5.2 IndexShare MTP module (in model.layers.78) |
First 2 + last 2 layers (model.layers.{0,1,76,77}) |
BF16 | Per DeepSeek-R1 boundary rule; layer 78+1 also captures the MTP head |
Sparse experts (*.experts.{gate,up,down}_proj) |
NVFP4 | Block-scaled FP4 — the bulk of the weights |
Shared experts (*.shared_experts.*) |
BF16 | Kept BF16 in this build |
Everything else not listed: NVFP4 block-scaled FP4.
Architecture
- Base model: GLM-5.2 (753B params, MoE, 78 transformer layers + 1 MTP layer at index 78, IndexShare sparse attention)
- Quantization: NVFP4 (E2M1 + FP8 E4M3, 16-element block scales)
- Block size: 16
- Quant method:
modelopt - Calibration: static per-block percentile-0.9999 scales (no forward-pass calibration — see Limitations)
- On-disk size: ~459 GB (NVFP4 packed weights + FP8 scales + BF16/FP32 kept layers)
- Compression: ~1.37 TB (BF16) → ~459 GB ≈ 3.0×
Hardware
- Required: NVIDIA Blackwell GPUs (B200, GB200, or RTX PRO 6000 Blackwell). NVFP4 tensor cores are Blackwell-only.
- VRAM for weights: ~459 GB → minimum 6× 96 GB GPUs just to hold weights; 8 GPUs recommended for KV cache headroom.
- Tested config: single node, 8× RTX PRO 6000 Blackwell (96 GB each), tensor-parallel 8.
- Does NOT fit on a single GPU.
- Inference: TensorRT-LLM, vLLM, or SGLang with
modeloptNVFP4 support.
Loading
vLLM (v0.23.0+)
from vllm import LLM, SamplingParams
llm = LLM(
model="Lorbus/GLM-5.2-NVFP4",
quantization="modelopt",
kv_cache_dtype="fp8",
tensor_parallel_size=8, # needs the full 8-GPU node
trust_remote_code=True,
max_model_len=1_000_000,
)
SGLang (v0.5.13.post1+)
python3 -m sglang.launch_server \
--model-path Lorbus/GLM-5.2-NVFP4 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8 \
--tp 8 \
--trust-remote-code \
--port 8888
Transformers (v0.5.12+) / KTransformers (v0.5.12+)
Both frameworks now natively load modelopt NVFP4 checkpoints with trust_remote_code=True. See framework docs for details.
Methodology
This quantization was produced with a per-shard streaming pipeline that downloads GLM-5.2 shards one at a time from HuggingFace Hub, quantizes each tensor in isolation, and writes the result back. We do not load the full BF16 model into VRAM (1.37 TB BF16 wouldn't fit on a 768 GB GPU box), and we do not run forward-pass calibration for the same reason.
Quality techniques applied (vs NVIDIA's full ModelOpt recipe):
| Technique | NVIDIA full | This build |
|---|---|---|
| E2M1 + FP8 block-scaled NVFP4 | yes | yes |
| Block size 16 | yes | yes |
| Mixed-precision routing (BF16 excludes) | yes | yes |
FP32 indexer weights_proj |
yes | yes |
| First/last N layers BF16 | yes | yes |
| Percentile (outlier-robust) scales | yes | yes |
fp8_scale_sweep (search 128 FP8 scales) |
yes | no (~0.5% est. loss) |
local_hessian calibration |
yes | no (~0.5% est. loss) |
moe_calib_experts_ratio (all-expert forward) |
yes | no (~1–2% est. loss for MoE) |
| Calibration forward passes on real data | yes | no (~1–3% est. loss) |
Expected quality: estimated 92–96% of NVIDIA's full ModelOpt NVFP4 recipe. This is an estimate, not a measurement — see Limitations.
Limitations
- No benchmark evaluations have been run. The 92–96% figure is an engineering estimate based on which calibration steps were skipped, not a measured score. Verify quality on your own downstream task before relying on it.
- We cannot reproduce NVIDIA's full PTQ pipeline because GLM-5.2 BF16
(1.37 TB) does not fit in the 768 GB VRAM of the build box, and
local_hessian/ forward-pass calibration require loading the full model. - The IndexShare sparse-attention design is GLM-5.2-specific; to our knowledge this is the first published quantization applying the DSA-style precision recipe to it. The indexer handling is by name-pattern, not a verified arch-level analysis.
- NVFP4 checkpoint support in serving frameworks is still marked experimental.
Reproducing
Build infrastructure:
- 8× NVIDIA RTX PRO 6000 Blackwell (96 GB each), PCIe-only (no NVLink)
- Streaming per-shard HF Hub download → per-tensor NVFP4 quant → write back
- 4 quantization workers (one per GPU), ~5 hours wall time
Citation
If you use this quantization, please credit the original model and NVIDIA's NVFP4 work:
- zai-org/GLM-5.2 — GLM-5.2 by Z.ai
- GLM-5 technical report (arXiv 2602.15763)
- IndexShare paper (arXiv 2603.12201)
- TensorRT-LLM: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs
- NVIDIA TensorRT Model Optimizer
License
MIT (inherited from GLM-5.2).
- Downloads last month
- 312
Model tree for Lorbus/GLM-5.2-NVFP4
Base model
zai-org/GLM-5.2