Instructions to use batiai/GLM-5.1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use batiai/GLM-5.1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="batiai/GLM-5.1-GGUF", filename="zai-org-GLM-5.1-IQ3_XXS-00001-of-00007.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use batiai/GLM-5.1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/GLM-5.1-GGUF:IQ3_XXS # Run inference directly in the terminal: llama-cli -hf batiai/GLM-5.1-GGUF:IQ3_XXS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/GLM-5.1-GGUF:IQ3_XXS # Run inference directly in the terminal: llama-cli -hf batiai/GLM-5.1-GGUF:IQ3_XXS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf batiai/GLM-5.1-GGUF:IQ3_XXS # Run inference directly in the terminal: ./llama-cli -hf batiai/GLM-5.1-GGUF:IQ3_XXS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf batiai/GLM-5.1-GGUF:IQ3_XXS # Run inference directly in the terminal: ./build/bin/llama-cli -hf batiai/GLM-5.1-GGUF:IQ3_XXS
Use Docker
docker model run hf.co/batiai/GLM-5.1-GGUF:IQ3_XXS
- LM Studio
- Jan
- vLLM
How to use batiai/GLM-5.1-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "batiai/GLM-5.1-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "batiai/GLM-5.1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/batiai/GLM-5.1-GGUF:IQ3_XXS
- Ollama
How to use batiai/GLM-5.1-GGUF with Ollama:
ollama run hf.co/batiai/GLM-5.1-GGUF:IQ3_XXS
- Unsloth Studio new
How to use batiai/GLM-5.1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/GLM-5.1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/GLM-5.1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for batiai/GLM-5.1-GGUF to start chatting
- Pi new
How to use batiai/GLM-5.1-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/GLM-5.1-GGUF:IQ3_XXS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "batiai/GLM-5.1-GGUF:IQ3_XXS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use batiai/GLM-5.1-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/GLM-5.1-GGUF:IQ3_XXS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default batiai/GLM-5.1-GGUF:IQ3_XXS
Run Hermes
hermes
- Docker Model Runner
How to use batiai/GLM-5.1-GGUF with Docker Model Runner:
docker model run hf.co/batiai/GLM-5.1-GGUF:IQ3_XXS
- Lemonade
How to use batiai/GLM-5.1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull batiai/GLM-5.1-GGUF:IQ3_XXS
Run and chat with the model
lemonade run user.GLM-5.1-GGUF-IQ3_XXS
List all available models
lemonade list
GLM-5.1 GGUF โ Quantized by BatiAI
IQ3_XXS / IQ4_XS quantization of zai-org/GLM-5.1 (744B total / 40B active MoE). Quantized directly from official Z.AI weights by BatiAI.
Why GLM-5.1?
- 744B parameters (40B active) โ frontier MoE with Deep Sparse Attention (DSA)
- #1 open-source on SWE-Bench Pro โ leads the open-weight pack on agentic coding
- 256 experts per layer (top-8 routing + DSA indexer) โ extreme sparsity
- 79 transformer blocks with hybrid attention/FFN routing
- MIT license โ fully permissive for commercial use, fine-tuning, redistribution
- Released by Z.AI / Zhipu AI โ same lineage as ChatGLM / GLM-4
Quick Start
# IQ3_XXS (smaller, 273GB โ needs 320GB+ unified RAM)
hf download batiai/GLM-5.1-GGUF --include "*IQ3_XXS*"
# IQ4_XS (recommended balance, 376GB โ needs 448GB+ unified RAM)
hf download batiai/GLM-5.1-GGUF --include "*IQ4_XS*"
Available Quantizations
| Quant | Total Size | Shards | Min RAM | Target Hardware |
|---|---|---|---|---|
| IQ3_XXS | 273 GB | 7 ร ~40 GB | ~320 GB | M3 Ultra 512GB / H100 node |
| IQ4_XS | 376 GB | 9 ร ~42 GB | ~448 GB | M3 Ultra 512GB / 8ร A100 80GB |
โ ๏ธ Not for consumer Mac โ workstation / server class. 16โ128GB Macs should use
batiai/qwen3.6-35borbatiai/minimax-m2.7. Mac Studio M2 Ultra 192GB users should usebatiai/kimi-k2.6:iq3(394GB but lighter active MoE) โ GLM-5.1 is denser at 40B active.
Hardware Reality Check
| Your System | IQ3_XXS (273GB) | IQ4_XS (376GB) |
|---|---|---|
| Mac 128GB | โ Won't fit | โ |
| Mac 192GB | โ ๏ธ Heavy swap (unusable) | โ |
| Mac 256GB | โ ๏ธ Tight (~50GB swap) | โ |
| Mac 384GB | โ Usable | โ ๏ธ Tight |
| Mac M3 Ultra 512GB | โ Comfortable | โ Usable |
| 2ร M3 Ultra (cluster) | โ Fast | โ Fast |
| 8ร A100 80GB (640GB) | โ Fast | โ Fast |
| H100 node | โ Fast | โ Fast |
Numbers based on MoE activation pattern โ 40B active params ร 2 bytes (Q4 active) โ 80GB runtime, plus shard buffers + KV cache (32K ctx โ 8-12GB). Going below the min RAM forces SSD paging which destroys throughput.
Special Engineering Notes
GLM-5.1 uses Deep Sparse Attention (DSA) โ a per-layer "indexer" tensor selects the top-K key positions for sparse attention. This required two fixes during quantization:
- DSA indexer tensors not in imatrix โ
--tensor-type indexer=q5_koverride (~600 MB overhead total) - Last block (blk.78) imatrix gap โ bati.cpp
llama-imatrixdoes not record the final block;--tensor-type blk.78=q5_kworkaround applied
Both flags are baked into our quantization pipeline (scripts/runtime/glm-pipeline.sh). The fallback Q5_K layer adds < 0.2% to file size but prevents low-bit IQ-quants from bailing on missing imatrix data.
79 of 1809 tensors used fallback quantization โ these are the indexer + last-block weights kept at higher precision.
What BatiAI's Quantization Delivers
| BatiAI | typical 3rd-party | |
|---|---|---|
| Source | Direct from official Z.AI weights | Often re-quantized from other GGUFs |
| Quantization flow | safetensors โ Q8_0 โ IQ3_XXS/IQ4_XS with imatrix (wikitext-2, 200 chunks) | Varies |
| imatrix | โ 200 chunks (quality saturation) | Often skipped or fewer chunks |
| DSA indexer handling | โ Q5_K override documented | Often unaddressed โ garbage low-bit |
| Last-block imatrix gap | โ Workaround applied | Often causes bail-out or quality loss |
| BatiAI signature | โ
general.author=BatiAI, general.url=https://flow.bati.ai |
โ |
Model Comparison โ BatiAI Lineup
| Your Hardware | Best BatiAI Model | Size |
|---|---|---|
| 16GB Mac | batiai/gemma4-e4b:q4 |
5GB |
| 24GB Mac | batiai/gemma4-26b:iq4 |
15GB |
| 48GB Mac | batiai/qwen3.6-35b:iq4 |
22GB |
| 96GB Mac | batiai/qwen3.6-35b:q6 |
29GB |
| 128GB Mac | batiai/minimax-m2.7:iq3 |
82GB |
| 192GB Mac Studio | batiai/kimi-k2.6:iq3 |
394GB (paged) |
| M3 Ultra 512GB | batiai/GLM-5.1:iq4 โฌ
here |
376GB |
| M3 Ultra 512GB (alt) | batiai/kimi-k2.6:iq4 |
546GB (heavy swap) |
GLM-5.1 IQ4_XS at 376 GB is the largest model that runs comfortably on a single M3 Ultra 512GB without SSD swap. Kimi K2.6 IQ4 (546GB) would page heavily on the same machine.
Benchmarks (source model)
| Benchmark | GLM-5.1 | Notes |
|---|---|---|
| SWE-Bench Pro | #1 open-source | Beats Kimi K2.6 (58.6) on coding tasks |
| HumanEval | High | Strong code generation |
| MMLU | Strong | General reasoning |
| Context | 32K (extendable via YARN) | |
| Tool use | โ Native | Function calling supported |
Numbers are from Z.AI's official report. Validating quantization preserves these on Mac M3 Ultra is pending (
bench.shon target hardware).
Technical Details
- Original Model: zai-org/GLM-5.1
- Architecture:
GlmMoeDsaForCausalLMโ 744B total / 40B active, 79 blocks (3 dense + 76 MoE), 256 experts (top-8), DSA hybrid attention - Original storage: BF16/FP8 mix (~1.4 TB safetensors)
- License: MIT
- Quantized with: bati.cpp v0.1.2 (BatiAI's llama.cpp fork โ needed for DSA architecture)
- Calibration: wikitext-2-raw, 200 chunks (quality saturation)
- imatrix overrides:
--tensor-type indexer=q5_k --tensor-type blk.78=q5_k - Quantized by: BatiAI
Usage
llama.cpp / bati.cpp
GLM-5.1 currently requires bati.cpp (BatiAI's llama.cpp fork) โ mainline ggml-org/llama.cpp does not yet support glm-dsa architecture. Will switch to mainline once support lands.
git clone https://github.com/batiai/bati.cpp.git
cd bati.cpp
cmake -B build -DGGML_METAL=ON # macOS
# or: cmake -B build -DGGML_CUDA=ON # Linux
cmake --build build -j --target llama-cli
hf download batiai/GLM-5.1-GGUF --include "*IQ4_XS*" --local-dir ./glm51
build/bin/llama-cli -m ./glm51/zai-org-GLM-5.1-IQ4_XS-00001-of-00009.gguf \
-p "Your prompt" \
--ctx-size 32768 \
--n-gpu-layers 99
Ollama
Ollama support pending โ will require glm-dsa arch upstream in ggml-org/llama.cpp first.
vLLM / TGI
Not directly compatible โ these serve FP8/BF16 safetensors. Use original zai-org/GLM-5.1 for vLLM.
About bati.cpp
batiai/bati.cpp is BatiAI's llama.cpp-based fork focused on:
- Apple Silicon (Metal) optimization
- Frontier-model early access (V4-Flash, GLM-5.1 DSA, etc.) before mainline merges
- BatiAI quantization standard (signature, imatrix workflow)
Built on top of ggml-org/llama.cpp and antirez/llama.cpp-deepseek-v4-flash (all MIT). See bati.cpp/ATTRIBUTION.md for full credits.
License
Inherits the source model license: MIT.
About BatiFlow
BatiFlow โ free on-device AI automation for Mac. 5MB native app, 60+ tools (KakaoTalk, iMessage, Slack, Calendar, Notes, Chrome, file system). Works with all batiai/* models.
- Downloads last month
- 1,493
3-bit
4-bit
Model tree for batiai/GLM-5.1-GGUF
Base model
zai-org/GLM-5.1