GLM-5.1 GGUF โ€” Quantized by BatiAI

BatiFlow zai-org

IQ3_XXS / IQ4_XS quantization of zai-org/GLM-5.1 (744B total / 40B active MoE). Quantized directly from official Z.AI weights by BatiAI.

Why GLM-5.1?

  • 744B parameters (40B active) โ€” frontier MoE with Deep Sparse Attention (DSA)
  • #1 open-source on SWE-Bench Pro โ€” leads the open-weight pack on agentic coding
  • 256 experts per layer (top-8 routing + DSA indexer) โ€” extreme sparsity
  • 79 transformer blocks with hybrid attention/FFN routing
  • MIT license โ€” fully permissive for commercial use, fine-tuning, redistribution
  • Released by Z.AI / Zhipu AI โ€” same lineage as ChatGLM / GLM-4

Quick Start

# IQ3_XXS (smaller, 273GB โ€” needs 320GB+ unified RAM)
hf download batiai/GLM-5.1-GGUF --include "*IQ3_XXS*"

# IQ4_XS (recommended balance, 376GB โ€” needs 448GB+ unified RAM)
hf download batiai/GLM-5.1-GGUF --include "*IQ4_XS*"

Available Quantizations

Quant Total Size Shards Min RAM Target Hardware
IQ3_XXS 273 GB 7 ร— ~40 GB ~320 GB M3 Ultra 512GB / H100 node
IQ4_XS 376 GB 9 ร— ~42 GB ~448 GB M3 Ultra 512GB / 8ร— A100 80GB

โš ๏ธ Not for consumer Mac โ€” workstation / server class. 16โ€“128GB Macs should use batiai/qwen3.6-35b or batiai/minimax-m2.7. Mac Studio M2 Ultra 192GB users should use batiai/kimi-k2.6:iq3 (394GB but lighter active MoE) โ€” GLM-5.1 is denser at 40B active.

Hardware Reality Check

Your System IQ3_XXS (273GB) IQ4_XS (376GB)
Mac 128GB โŒ Won't fit โŒ
Mac 192GB โš ๏ธ Heavy swap (unusable) โŒ
Mac 256GB โš ๏ธ Tight (~50GB swap) โŒ
Mac 384GB โœ… Usable โš ๏ธ Tight
Mac M3 Ultra 512GB โœ… Comfortable โœ… Usable
2ร— M3 Ultra (cluster) โœ… Fast โœ… Fast
8ร— A100 80GB (640GB) โœ… Fast โœ… Fast
H100 node โœ… Fast โœ… Fast

Numbers based on MoE activation pattern โ€” 40B active params ร— 2 bytes (Q4 active) โ‰ˆ 80GB runtime, plus shard buffers + KV cache (32K ctx โ‰ˆ 8-12GB). Going below the min RAM forces SSD paging which destroys throughput.

Special Engineering Notes

GLM-5.1 uses Deep Sparse Attention (DSA) โ€” a per-layer "indexer" tensor selects the top-K key positions for sparse attention. This required two fixes during quantization:

  1. DSA indexer tensors not in imatrix โ€” --tensor-type indexer=q5_k override (~600 MB overhead total)
  2. Last block (blk.78) imatrix gap โ€” bati.cpp llama-imatrix does not record the final block; --tensor-type blk.78=q5_k workaround applied

Both flags are baked into our quantization pipeline (scripts/runtime/glm-pipeline.sh). The fallback Q5_K layer adds < 0.2% to file size but prevents low-bit IQ-quants from bailing on missing imatrix data.

79 of 1809 tensors used fallback quantization โ€” these are the indexer + last-block weights kept at higher precision.

What BatiAI's Quantization Delivers

BatiAI typical 3rd-party
Source Direct from official Z.AI weights Often re-quantized from other GGUFs
Quantization flow safetensors โ†’ Q8_0 โ†’ IQ3_XXS/IQ4_XS with imatrix (wikitext-2, 200 chunks) Varies
imatrix โœ… 200 chunks (quality saturation) Often skipped or fewer chunks
DSA indexer handling โœ… Q5_K override documented Often unaddressed โ†’ garbage low-bit
Last-block imatrix gap โœ… Workaround applied Often causes bail-out or quality loss
BatiAI signature โœ… general.author=BatiAI, general.url=https://flow.bati.ai โœ—

Model Comparison โ€” BatiAI Lineup

Your Hardware Best BatiAI Model Size
16GB Mac batiai/gemma4-e4b:q4 5GB
24GB Mac batiai/gemma4-26b:iq4 15GB
48GB Mac batiai/qwen3.6-35b:iq4 22GB
96GB Mac batiai/qwen3.6-35b:q6 29GB
128GB Mac batiai/minimax-m2.7:iq3 82GB
192GB Mac Studio batiai/kimi-k2.6:iq3 394GB (paged)
M3 Ultra 512GB batiai/GLM-5.1:iq4 โฌ… here 376GB
M3 Ultra 512GB (alt) batiai/kimi-k2.6:iq4 546GB (heavy swap)

GLM-5.1 IQ4_XS at 376 GB is the largest model that runs comfortably on a single M3 Ultra 512GB without SSD swap. Kimi K2.6 IQ4 (546GB) would page heavily on the same machine.

Benchmarks (source model)

Benchmark GLM-5.1 Notes
SWE-Bench Pro #1 open-source Beats Kimi K2.6 (58.6) on coding tasks
HumanEval High Strong code generation
MMLU Strong General reasoning
Context 32K (extendable via YARN)
Tool use โœ… Native Function calling supported

Numbers are from Z.AI's official report. Validating quantization preserves these on Mac M3 Ultra is pending (bench.sh on target hardware).

Technical Details

  • Original Model: zai-org/GLM-5.1
  • Architecture: GlmMoeDsaForCausalLM โ€” 744B total / 40B active, 79 blocks (3 dense + 76 MoE), 256 experts (top-8), DSA hybrid attention
  • Original storage: BF16/FP8 mix (~1.4 TB safetensors)
  • License: MIT
  • Quantized with: bati.cpp v0.1.2 (BatiAI's llama.cpp fork โ€” needed for DSA architecture)
  • Calibration: wikitext-2-raw, 200 chunks (quality saturation)
  • imatrix overrides: --tensor-type indexer=q5_k --tensor-type blk.78=q5_k
  • Quantized by: BatiAI

Usage

llama.cpp / bati.cpp

GLM-5.1 currently requires bati.cpp (BatiAI's llama.cpp fork) โ€” mainline ggml-org/llama.cpp does not yet support glm-dsa architecture. Will switch to mainline once support lands.

git clone https://github.com/batiai/bati.cpp.git
cd bati.cpp
cmake -B build -DGGML_METAL=ON   # macOS
# or: cmake -B build -DGGML_CUDA=ON   # Linux
cmake --build build -j --target llama-cli

hf download batiai/GLM-5.1-GGUF --include "*IQ4_XS*" --local-dir ./glm51

build/bin/llama-cli -m ./glm51/zai-org-GLM-5.1-IQ4_XS-00001-of-00009.gguf \
    -p "Your prompt" \
    --ctx-size 32768 \
    --n-gpu-layers 99

Ollama

Ollama support pending โ€” will require glm-dsa arch upstream in ggml-org/llama.cpp first.

vLLM / TGI

Not directly compatible โ€” these serve FP8/BF16 safetensors. Use original zai-org/GLM-5.1 for vLLM.

About bati.cpp

batiai/bati.cpp is BatiAI's llama.cpp-based fork focused on:

  • Apple Silicon (Metal) optimization
  • Frontier-model early access (V4-Flash, GLM-5.1 DSA, etc.) before mainline merges
  • BatiAI quantization standard (signature, imatrix workflow)

Built on top of ggml-org/llama.cpp and antirez/llama.cpp-deepseek-v4-flash (all MIT). See bati.cpp/ATTRIBUTION.md for full credits.

License

Inherits the source model license: MIT.

About BatiFlow

BatiFlow โ€” free on-device AI automation for Mac. 5MB native app, 60+ tools (KakaoTalk, iMessage, Slack, Calendar, Notes, Chrome, file system). Works with all batiai/* models.

Downloads last month
1,493
GGUF
Model size
754B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for batiai/GLM-5.1-GGUF

Base model

zai-org/GLM-5.1
Quantized
(40)
this model

Collection including batiai/GLM-5.1-GGUF