Instructions to use Lorbus/GLM-5.2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Lorbus/GLM-5.2-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Lorbus/GLM-5.2-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("Lorbus/GLM-5.2-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("Lorbus/GLM-5.2-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Lorbus/GLM-5.2-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Lorbus/GLM-5.2-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lorbus/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Lorbus/GLM-5.2-NVFP4

SGLang

How to use Lorbus/GLM-5.2-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Lorbus/GLM-5.2-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lorbus/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Lorbus/GLM-5.2-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lorbus/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Lorbus/GLM-5.2-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Lorbus/GLM-5.2-NVFP4
```

GLM-5.2-NVFP4

NVFP4-quantized checkpoint of zai-org/GLM-5.2 (753B-param MoE with IndexShare sparse attention). Shrinks the BF16 checkpoint from ~1.37 TB to ~459 GB (≈3× smaller) so it fits on an 8-GPU Blackwell node (e.g. 8×96 GB) with room for long-context KV cache.

This is a community-built quantization of GLM-5.2 to NVIDIA's NVFP4 format (E2M1 + FP8 E4M3 scales, 16-element blocks), built using a per-shard streaming recipe derived from NVIDIA ModelOpt's NVFP4_EXPERTS_ONLY_CFG and TensorRT-LLM's DeepSeek-V3.2 precision strategy.

Note on footprint: this is a 753B-parameter model. Even at NVFP4 it is ~459 GB on disk and in VRAM (because attention, norms, embeddings, the router, MTP auxiliary heads, the indexer, and the first/last layers are deliberately kept in BF16/FP32 — see the precision table). It does not fit on a single GPU. Plan for a multi-GPU node with ≥ 6 GPUs for weights alone, and 8 GPUs in practice to leave headroom for KV cache at long context.

Format

Component	Precision	Notes
Embeddings, lm_head	BF16	NVIDIA excludes
All `norm` / `layernorm` / `k_norm` / `q_norm`	BF16	All norms stay BF16
Attention block (`.self_attn.`)	BF16	Per DeepSeek-R1 recipe
Indexer `weights_proj`	FP32	Per DeepSeek-V3.2 DSA recipe
Indexer low-rank (q_a, k_a)	BF16	Per DeepSeek-V3.2 DSA recipe
Router / gate	BF16	RouterGEMM uses BF16 inputs/weights
MTP auxiliary heads (`eh_proj`, `enorm`, `hnorm`, `shared_head`)	BF16	GLM-5.2 IndexShare MTP module (in `model.layers.78`)
First 2 + last 2 layers (`model.layers.{0,1,76,77}`)	BF16	Per DeepSeek-R1 boundary rule; layer 78+1 also captures the MTP head
Sparse experts (`*.experts.{gate,up,down}_proj`)	NVFP4	Block-scaled FP4 — the bulk of the weights
Shared experts (`.shared_experts.`)	BF16	Kept BF16 in this build

Everything else not listed: NVFP4 block-scaled FP4.

Architecture

Base model: GLM-5.2 (753B params, MoE, 78 transformer layers + 1 MTP layer at index 78, IndexShare sparse attention)
Quantization: NVFP4 (E2M1 + FP8 E4M3, 16-element block scales)
Block size: 16
Quant method: modelopt
Calibration: static per-block percentile-0.9999 scales (no forward-pass calibration — see Limitations)
On-disk size: ~459 GB (NVFP4 packed weights + FP8 scales + BF16/FP32 kept layers)
Compression: ~1.37 TB (BF16) → ~459 GB ≈ 3.0×

Hardware

Required: NVIDIA Blackwell GPUs (B200, GB200, or RTX PRO 6000 Blackwell). NVFP4 tensor cores are Blackwell-only.
VRAM for weights: ~459 GB → minimum 6× 96 GB GPUs just to hold weights; 8 GPUs recommended for KV cache headroom.
Tested config: single node, 8× RTX PRO 6000 Blackwell (96 GB each), tensor-parallel 8.
Does NOT fit on a single GPU.
Inference: TensorRT-LLM, vLLM, or SGLang with modelopt NVFP4 support.

Loading

vLLM (v0.23.0+)

from vllm import LLM, SamplingParams

llm = LLM(
    model="Lorbus/GLM-5.2-NVFP4",
    quantization="modelopt",
    kv_cache_dtype="fp8",
    tensor_parallel_size=8,   # needs the full 8-GPU node
    trust_remote_code=True,
    max_model_len=1_000_000,
)

SGLang (v0.5.13.post1+)

python3 -m sglang.launch_server \
    --model-path Lorbus/GLM-5.2-NVFP4 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8 \
    --tp 8 \
    --trust-remote-code \
    --port 8888

Transformers (v0.5.12+) / KTransformers (v0.5.12+)

Both frameworks now natively load modelopt NVFP4 checkpoints with trust_remote_code=True. See framework docs for details.

Methodology

This quantization was produced with a per-shard streaming pipeline that downloads GLM-5.2 shards one at a time from HuggingFace Hub, quantizes each tensor in isolation, and writes the result back. We do not load the full BF16 model into VRAM (1.37 TB BF16 wouldn't fit on a 768 GB GPU box), and we do not run forward-pass calibration for the same reason.

Quality techniques applied (vs NVIDIA's full ModelOpt recipe):

Technique	NVIDIA full	This build
E2M1 + FP8 block-scaled NVFP4	yes	yes
Block size 16	yes	yes
Mixed-precision routing (BF16 excludes)	yes	yes
FP32 indexer `weights_proj`	yes	yes
First/last N layers BF16	yes	yes
Percentile (outlier-robust) scales	yes	yes
`fp8_scale_sweep` (search 128 FP8 scales)	yes	no (~0.5% est. loss)
`local_hessian` calibration	yes	no (~0.5% est. loss)
`moe_calib_experts_ratio` (all-expert forward)	yes	no (~1–2% est. loss for MoE)
Calibration forward passes on real data	yes	no (~1–3% est. loss)

Expected quality: estimated 92–96% of NVIDIA's full ModelOpt NVFP4 recipe. This is an estimate, not a measurement — see Limitations.

Limitations

No benchmark evaluations have been run. The 92–96% figure is an engineering estimate based on which calibration steps were skipped, not a measured score. Verify quality on your own downstream task before relying on it.
We cannot reproduce NVIDIA's full PTQ pipeline because GLM-5.2 BF16 (1.37 TB) does not fit in the 768 GB VRAM of the build box, and local_hessian / forward-pass calibration require loading the full model.
The IndexShare sparse-attention design is GLM-5.2-specific; to our knowledge this is the first published quantization applying the DSA-style precision recipe to it. The indexer handling is by name-pattern, not a verified arch-level analysis.
NVFP4 checkpoint support in serving frameworks is still marked experimental.

Reproducing

Build infrastructure:

8× NVIDIA RTX PRO 6000 Blackwell (96 GB each), PCIe-only (no NVLink)
Streaming per-shard HF Hub download → per-tensor NVFP4 quant → write back
4 quantization workers (one per GPU), ~5 hours wall time

Citation

If you use this quantization, please credit the original model and NVIDIA's NVFP4 work:

License

MIT (inherited from GLM-5.2).

Downloads last month: 312

Safetensors

Model size

400B params

Tensor type

F32

BF16

Model tree for Lorbus/GLM-5.2-NVFP4

Base model

zai-org/GLM-5.2

Quantized

(24)

this model

Papers for Lorbus/GLM-5.2-NVFP4

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Paper • 2603.12201 • Published Mar 12 • 60

GLM-5: from Vibe Coding to Agentic Engineering

Paper • 2602.15763 • Published Feb 17 • 166