MiniMax-M3-oQ4 (oMLX, 4-bit)

4-bit oQ4 quantization of MiniMaxAI/MiniMax-M3 (428B-parameter MoE / ~23B active, minimax_m3_vl vision-language, MiniMax Sparse Attention) for oMLX on Apple Silicon.

Size: ~228 GiB · group-size 64 · 4.59 bpw effective (mixed precision over 426.85 B weights: ~97.6% 4-bit, with sensitivity-boosted 8-bit on the most sensitive tensors — lm_head, embeddings and a few attention layers — plus a small 5-bit fraction; norms unquantized) · vision tower preserved
Quantized from: the bf16 source (796 GB) via oMLX streaming quant + a position-heuristic sensitivity map (no full model load), then fused into the packed switch_mlp.gate_up_proj (129-row) layout required by the current mlx-vlm M3 code.

⚠️ Requirements — read before downloading

This checkpoint is in the fused gate_up_proj layout. It will not load on stock mlx-vlm.

mlx-vlm PR #1374 ("Minimax m3 support"), at the fused-layout revision — commit c0b3518 or later (verified on head 8fd6fe7, 2026-06-15). Earlier commits use the unfused layout and will report Received 855 parameters not in model. PR #1374 is also what's needed to run M3 at all (the minimax_m3_vl architecture is not in released mlx-vlm/mlx-lm).
trust_remote_code: true — M3 ships a custom HF processor via auto_map.
torch + torchvision installed in the serving env — M3's image/video processor imports torch (the MLX env does not include it by default).

Hardware: sustained/long generations need a large GPU working set (~500 GB on a 512 GB Mac Studio M3 Ultra). Short requests run comfortably; very long generations approach Apple's recommendedMaxWorkingSet ceiling. The fused layout in this checkpoint is what keeps long generations under that ceiling (the unfused layout OOMs).

Serving on oMLX

Place under your oMLX models directory and add a model_settings.json entry:

{
  "MiniMax-M3-oQ4": {
    "trust_remote_code": true,
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "force_sampling": true
  }
}

⚠️ oMLX integration — patches NOT yet in oMLX main

This checkpoint loads and generates on stock oMLX-main + mlx-vlm #1374, but three oMLX-side behaviours need small patches that are not yet upstream (oMLX main has no minimax_m3_vl handling). Without them you'll see the failure modes below. We use them in production and intend to upstream them; ping us if you want the diffs.

Area	oMLX file	What it does	Without it
Scheduling	`scheduler.py`	Serialize `minimax_m3_vl` (like Llama-4) + handle the MiniMax-Sparse-Attention KV cache (`MiniMaxM3KVCache` ↔ batch variant; #1374 `263a4e0` adds the model-side cache-merge)	`MiniMaxM3KVCache … does not support batching with history` under concurrency
Reasoning	`api/utils.py`	Map `<mm:think>`/`</mm:think>` → `<think>`/`</think>` before thinking extraction	CoT leaks into `content` instead of `reasoning_content`
Tool calls	`api/tool_calling.py` + `server.py`	Parse `<invoke name=…>` + bare `<key>value</key>` params and strip the `]<]minimax[>[` token (200058)	raw tool-call markup leaks into `content`, no structured `tool_calls`

The tool-call parser is the right candidate to land in mlx-vlm's tool_parsers (then selectable without an oMLX patch); the scheduler + reasoning bits are oMLX-side.

Reasoning format

M3 wraps chain-of-thought in <mm:think>…</mm:think> (vs the usual <think>). The api/utils.py mapping above turns it into a clean reasoning_content field.

Tool-call format

M3 emits (note ]<]minimax[>[ is special token 200058, the namespace marker):

]<]minimax[>[<tool_call>]<]minimax[>[<invoke name="FUNC">]<]minimax[>[<param>value]<]minimax[>[</param>]<]minimax[>[</invoke>]<]minimax[>[</tool_call>

i.e. <invoke name="..."> with bare <key>value</key> parameter tags (not <parameter name="key">). The api/tool_calling.py parser above converts this to structured tool_calls.

Benchmark (oMLX v3, role-mapped suite)

Warmup 93.9 s · decode ~21.7 tok/s · prefill ~214 tok/s · concurrent aggregate ~40.9 tok/s. Quality (A+→F → 4.3 scale): Overall 3.72 / Medical 3.80 — strong across coding/QA/legal/ops and clinical/pharma/psych; tool-calling (support) requires the parser above.

License

Inherits the MiniMax-M3 license. This is a quantized derivative for local inference.

Downloads last month: 461

Safetensors

Model size

68B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for unigilby/MiniMax-M3-oQ4

Base model

MiniMaxAI/MiniMax-M3

Quantized

(20)

this model