Instructions to use unigilby/MiniMax-M3-oQ4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use unigilby/MiniMax-M3-oQ4 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("unigilby/MiniMax-M3-oQ4") config = load_config("unigilby/MiniMax-M3-oQ4") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use unigilby/MiniMax-M3-oQ4 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "unigilby/MiniMax-M3-oQ4"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "unigilby/MiniMax-M3-oQ4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use unigilby/MiniMax-M3-oQ4 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "unigilby/MiniMax-M3-oQ4"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default unigilby/MiniMax-M3-oQ4
Run Hermes
hermes
MiniMax-M3-oQ4 (oMLX, 4-bit)
4-bit oQ4 quantization of MiniMaxAI/MiniMax-M3
(428B-parameter MoE / ~23B active, minimax_m3_vl vision-language, MiniMax Sparse Attention) for
oMLX on Apple Silicon.
- Size: ~228 GiB · group-size 64 · 4.59 bpw effective (mixed precision over 426.85 B weights: ~97.6% 4-bit, with sensitivity-boosted 8-bit on the most sensitive tensors — lm_head, embeddings and a few attention layers — plus a small 5-bit fraction; norms unquantized) · vision tower preserved
- Quantized from: the bf16 source (796 GB) via oMLX streaming quant + a position-heuristic
sensitivity map (no full model load), then fused into the packed
switch_mlp.gate_up_proj(129-row) layout required by the current mlx-vlm M3 code.
⚠️ Requirements — read before downloading
This checkpoint is in the fused gate_up_proj layout. It will not load on stock mlx-vlm.
- mlx-vlm PR #1374 ("Minimax m3 support"), at the
fused-layout revision — commit
c0b3518or later (verified on head8fd6fe7, 2026-06-15). Earlier commits use the unfused layout and will reportReceived 855 parameters not in model. PR #1374 is also what's needed to run M3 at all (theminimax_m3_vlarchitecture is not in released mlx-vlm/mlx-lm). trust_remote_code: true— M3 ships a custom HF processor viaauto_map.torch+torchvisioninstalled in the serving env — M3's image/video processor imports torch (the MLX env does not include it by default).
Hardware: sustained/long generations need a large GPU working set (~500 GB on a 512 GB Mac Studio
M3 Ultra). Short requests run comfortably; very long generations approach Apple's
recommendedMaxWorkingSet ceiling. The fused layout in this checkpoint is what keeps long generations
under that ceiling (the unfused layout OOMs).
Serving on oMLX
Place under your oMLX models directory and add a model_settings.json entry:
{
"MiniMax-M3-oQ4": {
"trust_remote_code": true,
"temperature": 1.0,
"top_p": 0.95,
"top_k": 40,
"force_sampling": true
}
}
⚠️ oMLX integration — patches NOT yet in oMLX main
This checkpoint loads and generates on stock oMLX-main + mlx-vlm #1374, but three oMLX-side
behaviours need small patches that are not yet upstream (oMLX main has no minimax_m3_vl handling).
Without them you'll see the failure modes below. We use them in production and intend to upstream them;
ping us if you want the diffs.
| Area | oMLX file | What it does | Without it |
|---|---|---|---|
| Scheduling | scheduler.py |
Serialize minimax_m3_vl (like Llama-4) + handle the MiniMax-Sparse-Attention KV cache (MiniMaxM3KVCache ↔ batch variant; #1374 263a4e0 adds the model-side cache-merge) |
MiniMaxM3KVCache … does not support batching with history under concurrency |
| Reasoning | api/utils.py |
Map <mm:think>/</mm:think> → <think>/</think> before thinking extraction |
CoT leaks into content instead of reasoning_content |
| Tool calls | api/tool_calling.py + server.py |
Parse <invoke name=…> + bare <key>value</key> params and strip the ]<]minimax[>[ token (200058) |
raw tool-call markup leaks into content, no structured tool_calls |
The tool-call parser is the right candidate to land in mlx-vlm's tool_parsers (then selectable
without an oMLX patch); the scheduler + reasoning bits are oMLX-side.
Reasoning format
M3 wraps chain-of-thought in <mm:think>…</mm:think> (vs the usual <think>). The api/utils.py
mapping above turns it into a clean reasoning_content field.
Tool-call format
M3 emits (note ]<]minimax[>[ is special token 200058, the namespace marker):
]<]minimax[>[<tool_call>]<]minimax[>[<invoke name="FUNC">]<]minimax[>[<param>value]<]minimax[>[</param>]<]minimax[>[</invoke>]<]minimax[>[</tool_call>
i.e. <invoke name="..."> with bare <key>value</key> parameter tags (not
<parameter name="key">). The api/tool_calling.py parser above converts this to structured
tool_calls.
Benchmark (oMLX v3, role-mapped suite)
Warmup 93.9 s · decode ~21.7 tok/s · prefill ~214 tok/s · concurrent aggregate ~40.9 tok/s. Quality (A+→F → 4.3 scale): Overall 3.72 / Medical 3.80 — strong across coding/QA/legal/ops and clinical/pharma/psych; tool-calling (support) requires the parser above.
License
Inherits the MiniMax-M3 license. This is a quantized derivative for local inference.
- Downloads last month
- 461
4-bit
Model tree for unigilby/MiniMax-M3-oQ4
Base model
MiniMaxAI/MiniMax-M3