Instructions to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw

Run Hermes

hermes

MLX LM

How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

GLM-5.2-Alis-MLX-Dynamic-3.5bpw

Apple Silicon (MLX) mixed-precision quantization of zai-org/GLM-5.2 — a 744B-parameter (~40B active) Mixture-of-Experts model with DeepSeek-V3.2-style MLA + DeepSeek Sparse Attention (DSA, glm_moe_dsa).

This build targets the "golden spot" for a 512 GB M3 Ultra: the best quality that still runs a full 1M-token context comfortably — ~3.5 bits/weight, served with int8 quantization of the MLA latent KV cache.

⚠️ Requires a patched mlx-lm with the glm_moe_dsa correctness fixes and int8 MLA-KV support. On stock mlx-lm it will not load/serve correctly, and --kv-bits is a silent no-op for this architecture (see Correctness and Long context).

Quality

A 2-bit (≤256 GB) build of this model is bit-starved; lifting the budget to 512 GB and spending 3.5 bpw is the single biggest quality lever.

	this build (3.5 bpw)	2.56 bpw build	Δ
wikitext-2 PPL (prose)	2.946	4.340	−32%
code PPL (mlx_lm src)	1.893	2.203	−14%

Strided perplexity (ctx 2048 / stride 1024) from a fixed local harness — relative numbers for comparing builds, not directly comparable to perplexities other quantizers report on different corpora.


Base model	zai-org/GLM-5.2 (744B total / ~40B active)
Bits/weight	~3.535 (per-tensor mixed)
On-disk size	~328 GB (306 GiB)
Peak memory (load + short gen)	~331 GB
Peak memory (1M ctx, int8 KV)	~376 GB
Format	MLX (Apple Silicon)
Context	up to 1M tokens (DSA sparse attention)

Recipe

Bits are allocated by sensitivity — cheap bits on the robust expert bulk, full precision on the discrete routing/sparse-attention control paths.

Component	Bits	Why
Routed experts (gate/up/down)	3-bit g64	~96% of params — the bulk
Shared experts · MLA attn (incl. `kv_a`/`q_a`) · dense MLP	4-bit g64	on every token's critical path
Token embedding · LM head	6-bit g64	distribution-sensitive
Router (`mlp.gate`)	bf16	drives discrete top-8 routing — never quantized
DSA lightning indexer	fp16	drives discrete top-k selection

Long context (1M) — int8 KV

GLM-5.2's MLA stores the compressed latent (kv_lora 512 + rope 64 per layer), so the 1M-token KV cache is small: ~95 GB at fp16, ~48 GB at int8. Quantizing the latent cache to int8 is what makes 1M fit comfortably at 3.5 bpw.

$Memory budget: 3.5 bpw + int8 KV = 376 GB fits 512 GB with \~136 GB free; fp16 KV is tighter; 4.5 bpw overflows$

Serve with int8 KV:

mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
  --kv-bits 8 --kv-group-size 64 --quantized-kv-start 4096 \
  --prompt "…"

# OpenAI-compatible server
mlx_lm.server --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
  --kv-bits 8 --quantized-kv-start 4096

The patched runtime quantizes only the MLA latent cache to int8 (the DSA indexer cache stays fp16) and dequantizes the latent on read inside MLA attention. At ≤128K context, fp16 KV is fine (a few GB); int8 is only needed to keep 1M comfortable.

How it compares

Against other public GLM-5.2 MLX builds, this one is the smallest on disk and the only one that runs a full 1M-token context within 512 GB — it spends its budget on context headroom + an int8 MLA-KV cache instead of purely on weight bits.

	this build	mixed-3_6	Q4.8-INF	DQ4plus-q8
effective bpw	~3.5	~3.6	~4.8	~5.0
on-disk	328 GB	~360 GB	447 GB	465 GB
DSA indexer	fp16	6-bit	(custom fmt)	8-bit
int8 MLA-KV	yes	no	no	no
1M ctx in 512 GB	✓ 136 GB free	tight	✗ over	✗ over

Per-component bit allocation was parsed from each build's config.json. The higher-bit builds (4.8–5.0 bpw) likely carry higher raw weight fidelity, but their footprint cannot hold a 1M KV cache on 512 GB. This build also keeps the DSA indexer at fp16 (the others quantize it) and bakes the indexer RoPE/eps long-context fixes.

Benchmarks

Reproduced with mlx_lm.evaluate (0-shot) and mlx_lm.perplexity (seq 2048, 50 samples, seed 123), against the author's earlier GLM-5.1 quant under the same harness and settings:

	GLM-5.1 · 2.7 bpw	GLM-5.2 · 2.56 bpw	GLM-5.2 · 3.5 bpw (this)
Perplexity (lower)	4.165	3.850	3.766
HellaSwag (acc_norm)	0.606	0.636	0.610
PIQA (acc)	0.796	0.796	0.828
WinoGrande (acc)	0.660	0.708	0.766
Generation (tok/s)	18.35	22.87	21.29

Perplexity here is on allenai/tulu-3-sft-mixture (the mlx_lm.perplexity default) — a different corpus and method from the wikitext strided figure in Quality above, so values are not comparable across the two sections. Task accuracies use a 500-sample limit (CI ±0.02–0.04). GLM-5.1 is a different (older) base model, so cross-generation gaps reflect both the newer model and quantization.

Correctness (verified vs the HF reference)

glm_moe_dsa needs fixes beyond the stock mlx-lm port; this build was produced with a patched fork and validated:

IndexShare — the DSA indexer runs only on "full" layers; "shared" layers reuse its top-k (index_topk_freq=4). The stock port built an indexer on every layer → missing-weights / wrong >2048-token output.
Indexer RoPE / eps — non-interleaved (half-split) RoPE + LayerNorm eps 1e-6, distinct from the interleaved main attention. Baked into config.json (indexer_rope_traditional=false, indexer_norm_eps=1e-6); post-RoPE q matches the reference to ~1e-7.
int8 MLA-KV — CacheList.to_quantized/offset + MLA dequant-on-read, so --kv-bits 8 actually engages for the latent cache (silently ignored on stock mlx-lm).

Validation: full-attention logits match the HF reference to float precision at ≤index_topk context; long-context needle retrieval in the sparse-DSA regime; coherent prose + code generation; int8-KV verified coherent past the quantization threshold; peak memory measured.

Usage

# requires mlx-lm with the glm_moe_dsa fixes + int8 MLA-KV patch
mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw \
  --prompt "Write a quicksort in Python."

Hardware

Built for 512 GB Apple Silicon (M3 Ultra). Weights ~328 GB; with int8 KV a 1M-token context runs in ~376 GB, leaving comfortable headroom for the OS and other apps. For ≤256 GB machines, use the 2.56 bpw build instead.

Credits

Base model: Zhipu / Z.ai — GLM-5.2 (MIT).
MLX & mlx-lm: Apple ml-explore.
Mixed-precision quantization, glm_moe_dsa correctness fixes, and int8 MLA-KV cache support: Alis (avlp12).

Citation

Alis (avlp12) (2026). GLM-5.2-Alis-MLX-Dynamic-3.5bpw — 3.5 bpw MLX quantization of GLM-5.2. https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw

Downloads last month: 1,208

Safetensors

Model size

743B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw

Base model

zai-org/GLM-5.2

Quantized

(57)

this model