Instructions to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw
Run Hermes
hermes
- MLX LM
How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw", "messages": [ {"role": "user", "content": "Hello"} ] }'
GLM-5.2-Alis-MLX-Dynamic-3.5bpw
Apple Silicon (MLX) mixed-precision quantization of zai-org/GLM-5.2 — a 744B-parameter (~40B active) Mixture-of-Experts model with DeepSeek-V3.2-style MLA + DeepSeek Sparse Attention (DSA, glm_moe_dsa).
This build targets the "golden spot" for a 512 GB M3 Ultra: the best quality that still runs a full 1M-token context comfortably — ~3.5 bits/weight, served with int8 quantization of the MLA latent KV cache.
⚠️ Requires a patched
mlx-lmwith theglm_moe_dsacorrectness fixes and int8 MLA-KV support. On stockmlx-lmit will not load/serve correctly, and--kv-bitsis a silent no-op for this architecture (see Correctness and Long context).
Quality
A 2-bit (≤256 GB) build of this model is bit-starved; lifting the budget to 512 GB and spending 3.5 bpw is the single biggest quality lever.
| this build (3.5 bpw) | 2.56 bpw build | Δ | |
|---|---|---|---|
| wikitext-2 PPL (prose) | 2.946 | 4.340 | −32% |
| code PPL (mlx_lm src) | 1.893 | 2.203 | −14% |
Strided perplexity (ctx 2048 / stride 1024) from a fixed local harness — relative numbers for comparing builds, not directly comparable to perplexities other quantizers report on different corpora.
| Base model | zai-org/GLM-5.2 (744B total / ~40B active) |
| Bits/weight | ~3.535 (per-tensor mixed) |
| On-disk size | ~328 GB (306 GiB) |
| Peak memory (load + short gen) | ~331 GB |
| Peak memory (1M ctx, int8 KV) | ~376 GB |
| Format | MLX (Apple Silicon) |
| Context | up to 1M tokens (DSA sparse attention) |
Recipe
Bits are allocated by sensitivity — cheap bits on the robust expert bulk, full precision on the discrete routing/sparse-attention control paths.
| Component | Bits | Why |
|---|---|---|
| Routed experts (gate/up/down) | 3-bit g64 | ~96% of params — the bulk |
Shared experts · MLA attn (incl. kv_a/q_a) · dense MLP |
4-bit g64 | on every token's critical path |
| Token embedding · LM head | 6-bit g64 | distribution-sensitive |
Router (mlp.gate) |
bf16 | drives discrete top-8 routing — never quantized |
| DSA lightning indexer | fp16 | drives discrete top-k selection |
Long context (1M) — int8 KV
GLM-5.2's MLA stores the compressed latent (kv_lora 512 + rope 64 per layer), so the 1M-token KV cache is small: ~95 GB at fp16, ~48 GB at int8. Quantizing the latent cache to int8 is what makes 1M fit comfortably at 3.5 bpw.
Serve with int8 KV:
mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
--kv-bits 8 --kv-group-size 64 --quantized-kv-start 4096 \
--prompt "…"
# OpenAI-compatible server
mlx_lm.server --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
--kv-bits 8 --quantized-kv-start 4096
The patched runtime quantizes only the MLA latent cache to int8 (the DSA indexer cache stays fp16) and dequantizes the latent on read inside MLA attention. At ≤128K context, fp16 KV is fine (a few GB); int8 is only needed to keep 1M comfortable.
How it compares
Against other public GLM-5.2 MLX builds, this one is the smallest on disk and the only one that runs a full 1M-token context within 512 GB — it spends its budget on context headroom + an int8 MLA-KV cache instead of purely on weight bits.
| this build | mixed-3_6 | Q4.8-INF | DQ4plus-q8 | |
|---|---|---|---|---|
| effective bpw | ~3.5 | ~3.6 | ~4.8 | ~5.0 |
| on-disk | 328 GB | ~360 GB | 447 GB | 465 GB |
| DSA indexer | fp16 | 6-bit | (custom fmt) | 8-bit |
| int8 MLA-KV | yes | no | no | no |
| 1M ctx in 512 GB | ✓ 136 GB free | tight | ✗ over | ✗ over |
Per-component bit allocation was parsed from each build's config.json. The higher-bit builds (4.8–5.0 bpw) likely carry higher raw weight fidelity, but their footprint cannot hold a 1M KV cache on 512 GB. This build also keeps the DSA indexer at fp16 (the others quantize it) and bakes the indexer RoPE/eps long-context fixes.
Benchmarks
Reproduced with mlx_lm.evaluate (0-shot) and mlx_lm.perplexity (seq 2048, 50 samples, seed 123), against the author's earlier GLM-5.1 quant under the same harness and settings:
| GLM-5.1 · 2.7 bpw | GLM-5.2 · 2.56 bpw | GLM-5.2 · 3.5 bpw (this) | |
|---|---|---|---|
| Perplexity (lower) | 4.165 | 3.850 | 3.766 |
| HellaSwag (acc_norm) | 0.606 | 0.636 | 0.610 |
| PIQA (acc) | 0.796 | 0.796 | 0.828 |
| WinoGrande (acc) | 0.660 | 0.708 | 0.766 |
| Generation (tok/s) | 18.35 | 22.87 | 21.29 |
Perplexity here is on allenai/tulu-3-sft-mixture (the mlx_lm.perplexity default) — a different corpus and method from the wikitext strided figure in Quality above, so values are not comparable across the two sections. Task accuracies use a 500-sample limit (CI ±0.02–0.04). GLM-5.1 is a different (older) base model, so cross-generation gaps reflect both the newer model and quantization.
Correctness (verified vs the HF reference)
glm_moe_dsa needs fixes beyond the stock mlx-lm port; this build was produced with a patched fork and validated:
- IndexShare — the DSA indexer runs only on "full" layers; "shared" layers reuse its top-k (
index_topk_freq=4). The stock port built an indexer on every layer → missing-weights / wrong >2048-token output. - Indexer RoPE / eps — non-interleaved (half-split) RoPE + LayerNorm eps 1e-6, distinct from the interleaved main attention. Baked into
config.json(indexer_rope_traditional=false,indexer_norm_eps=1e-6); post-RoPEqmatches the reference to ~1e-7. - int8 MLA-KV —
CacheList.to_quantized/offset+ MLA dequant-on-read, so--kv-bits 8actually engages for the latent cache (silently ignored on stockmlx-lm).
Validation: full-attention logits match the HF reference to float precision at ≤index_topk context; long-context needle retrieval in the sparse-DSA regime; coherent prose + code generation; int8-KV verified coherent past the quantization threshold; peak memory measured.
Usage
# requires mlx-lm with the glm_moe_dsa fixes + int8 MLA-KV patch
mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
--prompt "Write a quicksort in Python."
Hardware
Built for 512 GB Apple Silicon (M3 Ultra). Weights ~328 GB; with int8 KV a 1M-token context runs in ~376 GB, leaving comfortable headroom for the OS and other apps. For ≤256 GB machines, use the 2.56 bpw build instead.
Credits
- Base model: Zhipu / Z.ai — GLM-5.2 (MIT).
- MLX & mlx-lm: Apple
ml-explore. - Mixed-precision quantization,
glm_moe_dsacorrectness fixes, and int8 MLA-KV cache support: Alis (avlp12).
Citation
Alis (avlp12) (2026). GLM-5.2-Alis-MLX-Dynamic-3.5bpw — 3.5 bpw MLX quantization of GLM-5.2. https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw
- Downloads last month
- 1,208
4-bit
Model tree for avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw
Base model
zai-org/GLM-5.2


