GLM-5.2 — W4A16 (INT4) + BF16 MTP
An INT4 weight-only (W4A16) quantization of GLM-5.2 that preserves the BF16 multi-token-prediction (MTP) layer for speculative decoding. Quantized from zai-org/GLM-5.2 with llm-compressor (GPTQ).
Built for Hopper (H200). It matches FP8 quality on half the GPUs (4×H200 vs 8) and is the fastest H200-servable 4-bit GLM-5.2 in the interactive/agentic regime — because it ships a working MTP draft head and because NVFP4 has no FP4 tensor cores on Hopper (see below).
Why this model
- Half the footprint, FP8 quality. ~405 GB of weights (down from ~1.49 TB BF16) serve one replica on 4×H200 instead of 8 — freeing half the fleet, or two replicas per node — and eval matches the FP8 baseline within noise across reasoning, instruction-following, long-context, and agentic coding.
- Fastest 4-bit GLM-5.2 on Hopper for interactive/agentic workloads. At concurrency 1 it does 132 tok/s — +79% vs NVFP4, +69% vs AWQ-INT4, +48% vs FP8 — from MTP speculative decoding, which neither popular 4-bit competitor ships.
- Why it beats NVFP4 on Hopper specifically. NVFP4 is a Blackwell-native FP4 format; FP4 tensor cores exist only on Blackwell (sm_100/sm_103). On an H200 NVFP4 loads and runs but falls back to no FP4 acceleration, so it has no speed edge here while also shipping no MTP. On Hopper, this model wins; we make no claim about Blackwell (NVFP4's native throughput was not measured there).
- Honest trade-off. The MTP win is at low/medium concurrency; at full saturation (c32) the no-MTP quants edge ~13–15% ahead. Both directions are shown below.
Throughput at a glance (8×H200, vLLM bench, output tok/s, same harness)
| concurrency | This (W4A16+MTP) | nvidia NVFP4 | cyankiwi AWQ-INT4 | FP8 baseline |
|---|---|---|---|---|
| 1 (interactive) | 132 | 74 | 78 | 89 |
| 8 (mid) | 466 | 410 | 465 | 354 |
| 32 (saturated) | 825 | 944 | 960 | 953 |
Leads at c1/c8 (where latency matters for chat and agents); the simpler no-MTP quants pull ahead only once the batch is fully saturated at c32. Full eval and methodology below.
Purpose
GLM-5.2 (744B-parameter MoE) in BF16 needs ~1.49 TB of weights — eight 141 GB H200s, fully occupied, to serve one replica. The goal of this artifact is a smaller-footprint variant that matches FP8 quality so the model runs on four H200s instead of eight (freeing half the fleet, or two replicas per node), while keeping the MTP draft head for speculative-decode speedups. It is a deployment-efficiency artifact, not a new model — all capability comes from the base GLM-5.2.
Details
| Field | Value |
|---|---|
| Base model | zai-org/GLM-5.2 (BF16) |
| Architecture | GlmMoeDsaForCausalLM — 744B MoE, ~40B active, MLA + DeepSeek Sparse Attention, 1M context |
| Weight quantization | W4A16, INT4, asymmetric, group-size 128 (GPTQ, compressed-tensors), routed experts only |
| Kept in BF16 | attention, dense layers (0–2), shared experts, router/gate, embeddings, lm_head, MTP layer 78 |
| MTP | layer 78 preserved at BF16 for spec-decode (num_speculative_tokens=5) |
| Calibration | in-distribution chat/code set; calibrate_all_experts=True (visits every expert — see Method) |
| Size | ~405 GB (from ~1488 GB BF16) |
| License | MIT (inherited from the base model) |
The "FP8" sometimes seen in the filename refers to the fp8 KV-cache used at serving time, not the weights — the weights are INT4 (W4A16) and the MTP layer is BF16.
Evaluation — vs the FP8 baseline (same harness, 8×H200)
Measured against zai-org/GLM-5.2-FP8 under an identical setup (generative tasks via chat-completions with a
16,384-token generation budget for the reasoning CoT; matched serve config with --reasoning-parser).
| Task | This (W4A16+MTP) | FP8 baseline |
|---|---|---|
| GSM8K (strict) | 0.960 | 0.955 |
| IFEval (prompt-strict / inst-strict) | 0.909 / 0.911 | 0.891 / 0.903 |
| MATH-500 (math-verify) | 0.954 | 0.958 |
| RULER @ 32K | 0.832 | 0.831 |
| RULER @ 64K | 0.841 | 0.813 |
| SWE-bench Verified (mini-SWE-agent + official grading) | 82.0% (410/500) | 82.2% (411/500) |
Quantization preserves quality: scores track the FP8 baseline within run-to-run noise on reasoning, instruction-following, long-context retrieval, and agentic coding. (MMLU-Pro: FP8 full-set = 0.820; the W4A16 subset run was not completed — the verdict was already conclusive from the six tasks above. RULER used 50 samples per sub-task, not the full 500.)
Long context: serves at max_model_len=1,048,576 on 8×H200 and correctly retrieved a needle from a
~936K-token prompt (MLA + DSA compress the KV cache enough to fit 1M in the memory free after weights).
On 4×H200 it serves 128K validated (single-stream engine ceiling ~239K at gpu-memory-utilization=0.92;
256K overflows the post-weights KV budget) and retrieved a 64K needle at both mid- and end-placement.
MTP: speculative-decode acceptance 46–52% aggregate (95% at draft position 0) on 8×H200, confirming the
injected BF16 MTP layer is healthy. On 4×H200 (TP=4, 128K) aggregate acceptance is ~38% (7,848/20,765 draft
tokens, mean accept-length ~2.9) — mildly lower under the tighter memory split but still a net speedup.
Throughput (8×H200, vLLM bench, output tok/s):
| concurrency | This | FP8 |
|---|---|---|
| 1 | 132 (+48%) | 89 |
| 8 | 466 (+32%) | 354 |
| 32 | 825 (−13%) | 953 |
Faster than FP8 at low/medium concurrency (MTP speculative decoding helps most in the interactive regime) and slightly slower at full saturation — honest trade-off, both directions shown.
Throughput vs popular community 4-bit quants (8×H200, output tok/s, same harness):
| concurrency | This (W4A16+MTP) | cyankiwi AWQ-INT4 | nvidia NVFP4 |
|---|---|---|---|
| 1 | 132 | 78 | 74 |
| 8 | 466 | 465 | 410 |
| 32 | 825 | 960 | 944 |
Against the most-downloaded H200-servable GLM-5.2 4-bit quants: this model leads the interactive/agentic regime — +79% vs NVFP4 and +69% vs cyankiwi at concurrency 1, +14% vs NVFP4 at 8 — because of MTP speculative decoding (neither competitor ships a usable MTP head). At full saturation (c32) the no-MTP quants are ~15% faster (MTP's draft/verify overhead stops paying off once the batch is full). NVFP4 is a Blackwell-native FP4 format; on the H200 it runs without FP4 tensor cores (not measured on Blackwell). This is a throughput comparison only — all are ~4-bit quants of the same base model, so quality is close across the field.
Serving (vLLM ≥ 0.23, Hopper / H200)
The asymmetric W4A16 MoE requires expert parallelism (--enable-expert-parallel); plain tensor-parallel
trips a Marlin scale-sharding bug. The DSA indexer needs an nvcc ≥ 12.8 toolchain (CUDA_HOME).
8×H200 (up to 1M context):
vllm serve <repo> \
--tensor-parallel-size 8 --enable-expert-parallel \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
--reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
--max-model-len 1048576 --gpu-memory-utilization 0.90 --trust-remote-code
4×H200 (the footprint win, 128K validated / ~239K single-stream ceiling — 1M needs all 8):
vllm serve <repo> --tensor-parallel-size 4 --enable-expert-parallel \
--kv-cache-dtype fp8 --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
--reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
--max-model-len 32768 --gpu-memory-utilization 0.92 --trust-remote-code
Validated on Hopper (H200). On Blackwell (sm100) the serving kernels need extra flags and are not yet recommended for this artifact.
Method
- GPTQ W4A16 (group-128, asymmetric) on the routed experts only, with attention/dense/MTP/embeddings/
lm_head held at BF16.
calibrate_all_experts=Trueis required — calibrating only routed experts starves rarely-activated experts and produces a coherent-looking but degenerate model. - MTP preservation (Option-Y): GLM-5.2's MTP/nextn layer (index 78) isn't instantiated by
from_pretrained, so quantization never sees it. It is injected back at BF16 from the source checkpoint after quantization and added to theignorelist so the serving stack treats it as unquantized.
The full recipe, evaluation methodology, and a log of the engineering walls hit and overcome are in the companion repository (calibration memory limits, MoE sequential-target OOMs, the MTP-loss-on-save issue, the asymmetric-MoE serving fix, and the Blackwell toolchain gaps).
Limitations
- Throughput is ~13% below FP8 at very high concurrency (c32); the win is at low/medium concurrency.
- 1M-context serving requires all 8 H200s; 4×H200 serves up to ~128K (single-stream engine ceiling ~239K), with MTP acceptance ~38% (vs ~46–52% on 8×H200).
- Asymmetric weights require
--enable-expert-parallelto serve correctly. - Recommended on Hopper; Blackwell serving needs additional kernel flags.
Acknowledgements
Built on zai-org/GLM-5.2 (MIT). Quantized with llm-compressor; served with vLLM.
- Downloads last month
- 5
Model tree for canada-quant/GLM-5.2-W4A16-MTP
Base model
zai-org/GLM-5.2