GLM-5.2 — W4A16 (INT4) + BF16 MTP

An INT4 weight-only (W4A16) quantization of GLM-5.2 that preserves the BF16 multi-token-prediction (MTP) layer for speculative decoding. Quantized from zai-org/GLM-5.2 with llm-compressor (GPTQ).

Built for Hopper (H200). It matches FP8 quality on half the GPUs (4×H200 vs 8) and is the fastest H200-servable 4-bit GLM-5.2 in the interactive/agentic regime — because it ships a working MTP draft head and because NVFP4 has no FP4 tensor cores on Hopper (see below).

Why this model

  • Half the footprint, FP8 quality. ~405 GB of weights (down from ~1.49 TB BF16) serve one replica on 4×H200 instead of 8 — freeing half the fleet, or two replicas per node — and eval matches the FP8 baseline within noise across reasoning, instruction-following, long-context, and agentic coding.
  • Fastest 4-bit GLM-5.2 on Hopper for interactive/agentic workloads. At concurrency 1 it does 132 tok/s — +79% vs NVFP4, +69% vs AWQ-INT4, +48% vs FP8 — from MTP speculative decoding, which neither popular 4-bit competitor ships.
  • Why it beats NVFP4 on Hopper specifically. NVFP4 is a Blackwell-native FP4 format; FP4 tensor cores exist only on Blackwell (sm_100/sm_103). On an H200 NVFP4 loads and runs but falls back to no FP4 acceleration, so it has no speed edge here while also shipping no MTP. On Hopper, this model wins; we make no claim about Blackwell (NVFP4's native throughput was not measured there).
  • Honest trade-off. The MTP win is at low/medium concurrency; at full saturation (c32) the no-MTP quants edge ~13–15% ahead. Both directions are shown below.

Throughput at a glance (8×H200, vLLM bench, output tok/s, same harness)

concurrency This (W4A16+MTP) nvidia NVFP4 cyankiwi AWQ-INT4 FP8 baseline
1 (interactive) 132 74 78 89
8 (mid) 466 410 465 354
32 (saturated) 825 944 960 953

Leads at c1/c8 (where latency matters for chat and agents); the simpler no-MTP quants pull ahead only once the batch is fully saturated at c32. Full eval and methodology below.

Purpose

GLM-5.2 (744B-parameter MoE) in BF16 needs ~1.49 TB of weights — eight 141 GB H200s, fully occupied, to serve one replica. The goal of this artifact is a smaller-footprint variant that matches FP8 quality so the model runs on four H200s instead of eight (freeing half the fleet, or two replicas per node), while keeping the MTP draft head for speculative-decode speedups. It is a deployment-efficiency artifact, not a new model — all capability comes from the base GLM-5.2.

Details

Field Value
Base model zai-org/GLM-5.2 (BF16)
Architecture GlmMoeDsaForCausalLM — 744B MoE, ~40B active, MLA + DeepSeek Sparse Attention, 1M context
Weight quantization W4A16, INT4, asymmetric, group-size 128 (GPTQ, compressed-tensors), routed experts only
Kept in BF16 attention, dense layers (0–2), shared experts, router/gate, embeddings, lm_head, MTP layer 78
MTP layer 78 preserved at BF16 for spec-decode (num_speculative_tokens=5)
Calibration in-distribution chat/code set; calibrate_all_experts=True (visits every expert — see Method)
Size ~405 GB (from ~1488 GB BF16)
License MIT (inherited from the base model)

The "FP8" sometimes seen in the filename refers to the fp8 KV-cache used at serving time, not the weights — the weights are INT4 (W4A16) and the MTP layer is BF16.

Evaluation — vs the FP8 baseline (same harness, 8×H200)

Measured against zai-org/GLM-5.2-FP8 under an identical setup (generative tasks via chat-completions with a 16,384-token generation budget for the reasoning CoT; matched serve config with --reasoning-parser).

Task This (W4A16+MTP) FP8 baseline
GSM8K (strict) 0.960 0.955
IFEval (prompt-strict / inst-strict) 0.909 / 0.911 0.891 / 0.903
MATH-500 (math-verify) 0.954 0.958
RULER @ 32K 0.832 0.831
RULER @ 64K 0.841 0.813
SWE-bench Verified (mini-SWE-agent + official grading) 82.0% (410/500) 82.2% (411/500)

Quantization preserves quality: scores track the FP8 baseline within run-to-run noise on reasoning, instruction-following, long-context retrieval, and agentic coding. (MMLU-Pro: FP8 full-set = 0.820; the W4A16 subset run was not completed — the verdict was already conclusive from the six tasks above. RULER used 50 samples per sub-task, not the full 500.)

Long context: serves at max_model_len=1,048,576 on 8×H200 and correctly retrieved a needle from a ~936K-token prompt (MLA + DSA compress the KV cache enough to fit 1M in the memory free after weights). On 4×H200 it serves 128K validated (single-stream engine ceiling ~239K at gpu-memory-utilization=0.92; 256K overflows the post-weights KV budget) and retrieved a 64K needle at both mid- and end-placement.

MTP: speculative-decode acceptance 46–52% aggregate (95% at draft position 0) on 8×H200, confirming the injected BF16 MTP layer is healthy. On 4×H200 (TP=4, 128K) aggregate acceptance is ~38% (7,848/20,765 draft tokens, mean accept-length ~2.9) — mildly lower under the tighter memory split but still a net speedup.

Throughput (8×H200, vLLM bench, output tok/s):

concurrency This FP8
1 132 (+48%) 89
8 466 (+32%) 354
32 825 (−13%) 953

Faster than FP8 at low/medium concurrency (MTP speculative decoding helps most in the interactive regime) and slightly slower at full saturation — honest trade-off, both directions shown.

Throughput vs popular community 4-bit quants (8×H200, output tok/s, same harness):

concurrency This (W4A16+MTP) cyankiwi AWQ-INT4 nvidia NVFP4
1 132 78 74
8 466 465 410
32 825 960 944

Against the most-downloaded H200-servable GLM-5.2 4-bit quants: this model leads the interactive/agentic regime — +79% vs NVFP4 and +69% vs cyankiwi at concurrency 1, +14% vs NVFP4 at 8 — because of MTP speculative decoding (neither competitor ships a usable MTP head). At full saturation (c32) the no-MTP quants are ~15% faster (MTP's draft/verify overhead stops paying off once the batch is full). NVFP4 is a Blackwell-native FP4 format; on the H200 it runs without FP4 tensor cores (not measured on Blackwell). This is a throughput comparison only — all are ~4-bit quants of the same base model, so quality is close across the field.

Serving (vLLM ≥ 0.23, Hopper / H200)

The asymmetric W4A16 MoE requires expert parallelism (--enable-expert-parallel); plain tensor-parallel trips a Marlin scale-sharding bug. The DSA indexer needs an nvcc ≥ 12.8 toolchain (CUDA_HOME).

8×H200 (up to 1M context):

vllm serve <repo> \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
  --reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
  --max-model-len 1048576 --gpu-memory-utilization 0.90 --trust-remote-code

4×H200 (the footprint win, 128K validated / ~239K single-stream ceiling — 1M needs all 8):

vllm serve <repo> --tensor-parallel-size 4 --enable-expert-parallel \
  --kv-cache-dtype fp8 --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
  --reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
  --max-model-len 32768 --gpu-memory-utilization 0.92 --trust-remote-code

Validated on Hopper (H200). On Blackwell (sm100) the serving kernels need extra flags and are not yet recommended for this artifact.

Method

  1. GPTQ W4A16 (group-128, asymmetric) on the routed experts only, with attention/dense/MTP/embeddings/ lm_head held at BF16. calibrate_all_experts=True is required — calibrating only routed experts starves rarely-activated experts and produces a coherent-looking but degenerate model.
  2. MTP preservation (Option-Y): GLM-5.2's MTP/nextn layer (index 78) isn't instantiated by from_pretrained, so quantization never sees it. It is injected back at BF16 from the source checkpoint after quantization and added to the ignore list so the serving stack treats it as unquantized.

The full recipe, evaluation methodology, and a log of the engineering walls hit and overcome are in the companion repository (calibration memory limits, MoE sequential-target OOMs, the MTP-loss-on-save issue, the asymmetric-MoE serving fix, and the Blackwell toolchain gaps).

Limitations

  • Throughput is ~13% below FP8 at very high concurrency (c32); the win is at low/medium concurrency.
  • 1M-context serving requires all 8 H200s; 4×H200 serves up to ~128K (single-stream engine ceiling ~239K), with MTP acceptance ~38% (vs ~46–52% on 8×H200).
  • Asymmetric weights require --enable-expert-parallel to serve correctly.
  • Recommended on Hopper; Blackwell serving needs additional kernel flags.

Acknowledgements

Built on zai-org/GLM-5.2 (MIT). Quantized with llm-compressor; served with vLLM.

Downloads last month
5
Safetensors
Model size
116B params
Tensor type
I64
·
F32
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for canada-quant/GLM-5.2-W4A16-MTP

Base model

zai-org/GLM-5.2
Quantized
(73)
this model