MiniMax-M3 · REAP-32 · JANG_2L

⚠️ Requires vMLX ≥ v1.5.62

Earlier vMLX builds contain a runtime cache bug that causes repetition loops on long output. This is an engine issue, not a weights issue — update vMLX to v1.5.62 or later before running this model. On v1.5.62+ generation is clean.

A space-efficient MiniMax-M3 bundle for Apple Silicon: 32 % REAP expert pruning + JANG_2L mixed-precision quantization, ~105 GB, runs on a single 128 GB Mac via vMLX / MLX.

What this is

  • Base: MiniMax-M3 (model_type=minimax_m3_vl) — MoE, GQA-4, MSA Lightning Indexer, vision tower.
  • Pruning: REAP saliency pruning, 32 % of routed experts removed (87 of 128 kept per MoE layer), highest-saliency experts retained.
  • Quantization (JANG_2L, affine, group size 64):
    tensor bits
    routed experts gate_proj / up_proj 2
    routed experts down_proj 3
    shared experts 6
    dense MLP (layers 0–2) 6
    attention q/k/v/o 8
    embeddings 6
    lm_head 8
    vision tower + projectors 8
    norms, router gate, MSA indexer fp16
    down_proj is kept at 3-bit (the rest of the routed experts are 2-bit) for stable long-form coherency. The full per-module bit map is written into config.json (quantization) and applied automatically by the loader.

Usage

Load in vMLX (v1.5.62+); the engine autodetects minimax_m3_vl and applies the correct settings (native MSA cache, paged cache off, per-module quant map). Sampling defaults ship in generation_config.json (temperature=1.0, top_p=0.95).

Attribution

  • Quantization & packaging: Jinho Jang · eric@jangq.ai
  • Base model © MiniMax, used under the MiniMax-M3 license.
Downloads last month
332
Safetensors
Model size
33B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including JANGQ-AI/MiniMax-M3-Medium-JANG_2L