GLM-4.7 NVFP4

NVFP4 (W4A4) quantization of zai-org/GLM-4.7, produced with NVIDIA TensorRT Model Optimizer 0.43.0.

  • Base model: GLM-4.7 (Glm4MoeForCausalLM), 92 layers, 160 routed experts, hidden size 5120, BF16 weights (~668 GB).
  • Quantization: NVFP4 — 4-bit weights + 4-bit activations, block size 16, per-block FP8 E4M3 scales, per-tensor FP32 global scales.
  • Excluded from quantization: lm_head, embeddings, router gates.
  • Size on disk: ~200 GB (4 safetensors shards).

Serve with vLLM

vllm serve <local_path_or_hub_id> \
  --quantization modelopt \
  --tensor-parallel-size 8 \
  --max-model-len 65536

Then query:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<local_path_or_hub_id>",
    "messages": [{"role":"user","content":"Hello!"}],
    "max_tokens": 128
  }'

Hardware

NVFP4 GEMM kernels require NVIDIA Blackwell (SM100, e.g. B200/GB200). Earlier architectures can run via vLLM's NVFP4 emulation path but will not realize the throughput benefits.

License

Inherits the license of the base model (zai-org/GLM-4.7).

Downloads last month
191
Safetensors
Model size
199B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AH22-neb/GLM-4.7-NVFP4

Base model

zai-org/GLM-4.7
Quantized
(44)
this model