Macaron-V1-Preview-749B-MLX-4bit

MLX (Apple Silicon) conversion of mindlab-research/Macaron-V1-Preview-749B — a 749B-parameter glm_moe_dsa (DeepSeek-V3.2-style sparse-attention MoE, 256 experts) — quantized to 4-bit. First MLX build of this model.

Quantizations

Part of the Macaron-V1-Preview-749B MLX collection.

Variant Notes
4-bit (this repo) 4-bit · ~430GB · tight on 512GB
mixed mixed · experts@3-bit, non-expert@6-bit · ~360GB · comfortable 512GB fit

The mixed build keeps the routed experts at 3-bit and the precision-sensitive non-expert layers (attention, shared experts, dense layers, embeddings, lm_head) at 6-bit, sized to run comfortably on a 512 GB Mac.

Use with mlx-lm

pip install mlx-lm
python -m mlx_lm generate --model pipenetwork/Macaron-V1-Preview-749B-MLX-4bit --prompt "Hello" -m 128

Validation

Smoke-tested locally (loads + generates coherent text).

License

MIT (inherited from base). Quantization config (excerpt): {"group_size": 64, "bits": 4, "mode": "affine"}.

Downloads last month
-
Safetensors
Model size
744B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pipenetwork/Macaron-V1-Preview-749B-MLX-4bit

Base model

zai-org/GLM-5.1
Quantized
(2)
this model

Collection including pipenetwork/Macaron-V1-Preview-749B-MLX-4bit