GLM-5.2-Alis-MLX-Dynamic-3.5bpw

Apple Silicon (MLX) mixed-precision quantization of zai-org/GLM-5.2 — a 744B-parameter (~40B active) Mixture-of-Experts model with DeepSeek-V3.2-style MLA + DeepSeek Sparse Attention (DSA, glm_moe_dsa).

This build targets the "golden spot" for a 512 GB M3 Ultra: the best quality that still runs a full 1M-token context comfortably — ~3.5 bits/weight, served with int8 quantization of the MLA latent KV cache.

⚠️ Requires a patched mlx-lm with the glm_moe_dsa correctness fixes and int8 MLA-KV support. On stock mlx-lm it will not load/serve correctly, and --kv-bits is a silent no-op for this architecture (see Correctness and Long context).


Quality

A 2-bit (≤256 GB) build of this model is bit-starved; lifting the budget to 512 GB and spending 3.5 bpw is the single biggest quality lever.

Perplexity: 3.5 bpw vs the 2.56 bpw build — −32% wikitext, −14% code

this build (3.5 bpw) 2.56 bpw build Δ
wikitext-2 PPL (prose) 2.946 4.340 −32%
code PPL (mlx_lm src) 1.893 2.203 −14%

Strided perplexity (ctx 2048 / stride 1024) from a fixed local harness — relative numbers for comparing builds, not directly comparable to perplexities other quantizers report on different corpora.

Base model zai-org/GLM-5.2 (744B total / ~40B active)
Bits/weight ~3.535 (per-tensor mixed)
On-disk size ~328 GB (306 GiB)
Peak memory (load + short gen) ~331 GB
Peak memory (1M ctx, int8 KV) ~376 GB
Format MLX (Apple Silicon)
Context up to 1M tokens (DSA sparse attention)

Recipe

Bits are allocated by sensitivity — cheap bits on the robust expert bulk, full precision on the discrete routing/sparse-attention control paths.

Mixed-precision recipe: experts 3-bit, attention/shared/dense 4-bit, embed/head 6-bit, router bf16, indexer fp16

Component Bits Why
Routed experts (gate/up/down) 3-bit g64 ~96% of params — the bulk
Shared experts · MLA attn (incl. kv_a/q_a) · dense MLP 4-bit g64 on every token's critical path
Token embedding · LM head 6-bit g64 distribution-sensitive
Router (mlp.gate) bf16 drives discrete top-8 routing — never quantized
DSA lightning indexer fp16 drives discrete top-k selection

Long context (1M) — int8 KV

GLM-5.2's MLA stores the compressed latent (kv_lora 512 + rope 64 per layer), so the 1M-token KV cache is small: ~95 GB at fp16, ~48 GB at int8. Quantizing the latent cache to int8 is what makes 1M fit comfortably at 3.5 bpw.

Memory budget: 3.5 bpw + int8 KV = 376 GB fits 512 GB with \~136 GB free; fp16 KV is tighter; 4.5 bpw overflows

Serve with int8 KV:

mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
  --kv-bits 8 --kv-group-size 64 --quantized-kv-start 4096 \
  --prompt "…"

# OpenAI-compatible server
mlx_lm.server --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
  --kv-bits 8 --quantized-kv-start 4096

The patched runtime quantizes only the MLA latent cache to int8 (the DSA indexer cache stays fp16) and dequantizes the latent on read inside MLA attention. At ≤128K context, fp16 KV is fine (a few GB); int8 is only needed to keep 1M comfortable.


How it compares

Against other public GLM-5.2 MLX builds, this one is the smallest on disk and the only one that runs a full 1M-token context within 512 GB — it spends its budget on context headroom + an int8 MLA-KV cache instead of purely on weight bits.

1M-context footprint vs other GLM-5.2 MLX builds: ours 376 GB fits with 136 GB free; mixed-3_6 455 GB tight; Q4.8-INF 542 GB and DQ4plus-q8 560 GB exceed 512 GB

this build mixed-3_6 Q4.8-INF DQ4plus-q8
effective bpw ~3.5 ~3.6 ~4.8 ~5.0
on-disk 328 GB ~360 GB 447 GB 465 GB
DSA indexer fp16 6-bit (custom fmt) 8-bit
int8 MLA-KV yes no no no
1M ctx in 512 GB ✓ 136 GB free tight ✗ over ✗ over

Per-component bit allocation was parsed from each build's config.json. The higher-bit builds (4.8–5.0 bpw) likely carry higher raw weight fidelity, but their footprint cannot hold a 1M KV cache on 512 GB. This build also keeps the DSA indexer at fp16 (the others quantize it) and bakes the indexer RoPE/eps long-context fixes.


Benchmarks

Reproduced with mlx_lm.evaluate (0-shot) and mlx_lm.perplexity (seq 2048, 50 samples, seed 123), against the author's earlier GLM-5.1 quant under the same harness and settings:

GLM-5.1 · 2.7 bpw GLM-5.2 · 2.56 bpw GLM-5.2 · 3.5 bpw (this)
Perplexity (lower) 4.165 3.850 3.766
HellaSwag (acc_norm) 0.606 0.636 0.610
PIQA (acc) 0.796 0.796 0.828
WinoGrande (acc) 0.660 0.708 0.766
Generation (tok/s) 18.35 22.87 21.29

Perplexity here is on allenai/tulu-3-sft-mixture (the mlx_lm.perplexity default) — a different corpus and method from the wikitext strided figure in Quality above, so values are not comparable across the two sections. Task accuracies use a 500-sample limit (CI ±0.02–0.04). GLM-5.1 is a different (older) base model, so cross-generation gaps reflect both the newer model and quantization.


Correctness (verified vs the HF reference)

glm_moe_dsa needs fixes beyond the stock mlx-lm port; this build was produced with a patched fork and validated:

  • IndexShare — the DSA indexer runs only on "full" layers; "shared" layers reuse its top-k (index_topk_freq=4). The stock port built an indexer on every layer → missing-weights / wrong >2048-token output.
  • Indexer RoPE / eps — non-interleaved (half-split) RoPE + LayerNorm eps 1e-6, distinct from the interleaved main attention. Baked into config.json (indexer_rope_traditional=false, indexer_norm_eps=1e-6); post-RoPE q matches the reference to ~1e-7.
  • int8 MLA-KVCacheList.to_quantized/offset + MLA dequant-on-read, so --kv-bits 8 actually engages for the latent cache (silently ignored on stock mlx-lm).

Validation: full-attention logits match the HF reference to float precision at ≤index_topk context; long-context needle retrieval in the sparse-DSA regime; coherent prose + code generation; int8-KV verified coherent past the quantization threshold; peak memory measured.


Usage

# requires mlx-lm with the glm_moe_dsa fixes + int8 MLA-KV patch
mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
  --prompt "Write a quicksort in Python."

Hardware

Built for 512 GB Apple Silicon (M3 Ultra). Weights ~328 GB; with int8 KV a 1M-token context runs in ~376 GB, leaving comfortable headroom for the OS and other apps. For ≤256 GB machines, use the 2.56 bpw build instead.

Credits

  • Base model: Zhipu / Z.ai — GLM-5.2 (MIT).
  • MLX & mlx-lm: Apple ml-explore.
  • Mixed-precision quantization, glm_moe_dsa correctness fixes, and int8 MLA-KV cache support: Alis (avlp12).

Citation

Alis (avlp12) (2026). GLM-5.2-Alis-MLX-Dynamic-3.5bpw — 3.5 bpw MLX quantization of GLM-5.2. https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw

Downloads last month
1,208
Safetensors
Model size
743B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw

Base model

zai-org/GLM-5.2
Quantized
(57)
this model