GLM 5.1 optimized to run on a Mac Studio M3 512. This is the quality-first version. Alternatives: speed-first, balanced

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 4-bit baseline with important layers at higher precision.
  • Fits into ~420 GB memory, leaving plenty of room to run a smaller parallel model (ex: Qwen 3.6 35B).

Usage

# Start server at http://localhost:8080/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/GLM-5.1-MLX-4.5bit

Benchmarks

metric baa-ai/GLM-5.1-RAM-270GB-MLX 2.9 bit 3.6 bit 4.5 bit (this model)
bpw 3.110 2.906 3.645 4.538
base memory 269.303 251.702 315.648 392.992
peak memory (1024/512) 291.257 272.358 341.020 424.067
prompt tok/s (1024) 194.958 卤 0.075 194.216 卤 0.167 190.508 卤 0.880 193.563 卤 0.094
gen tok/s (512) 21.381 卤 0.050 19.527 卤 0.035 17.873 卤 0.156 17.259 卤 0.032
kl mean* 0.686 卤 0.054 0.268 卤 0.009 0.117 卤 0.004 0.048 卤 0.002
kl p95* 1.478 卤 0.054 0.537 卤 0.009 0.236 卤 0.004 0.097 卤 0.002
perplexity 4.780 卤 0.020 4.118 卤 0.016 3.945 卤 0.016 3.920 卤 0.016
piqa 0.776 卤 0.010 0.794 卤 0.009 0.820 卤 0.017 0.814 卤 0.017

* GLM 5.1 KL divergence calculated against the largest quant I could run locally (~495 GB), so real KL is higher.

Tested on a Mac Studio M3 Ultra with:

mlx_lm.kld --baseline-model path/to/mlx-full-precision
mlx_lm.perplexity --sequence-length 2048 --seed 123
mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 500

mlx_lm.kld is approximate, based on top_k not full logits. Here's the code.

Methodology

Quantized with a mlx-lm fork, drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ from llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
774
Safetensors
Model size
744B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/GLM-5.1-MLX-4.5bit

Base model

zai-org/GLM-5.1
Quantized
(40)
this model