Configuration Parsing Warning:In config.json: "quantization_config.bits" must be less than or equal to 8

MiniMax-M3 · AutoRound 3.2-bit · long-context vLLM port

Mixed 2/3/4/8-bit (AutoRound) quantization of MiniMax-M3 (~428B-param MoE, 60 layers, 128 experts top-4 + 1 shared expert) served on the official native vLLM 0.23.1 MiniMax-M3 model. Unlike the out-of-tree port (≤2048 ctx), this uses the model's MSA "lightning indexer" so it does long context for free — validated to 30K tokens (KV holds ~46K) on 2× RTX PRO 6000 Blackwell.

This card documents the integration (checkpoint + loader + serving). The quantization itself is OneComp's mixed-bit GPTQ pipeline.


Status / results (2026-06-17)

Correctness ✅ "The capital of France is" → " Paris. …London…Madrid"; "日本の首都は" → "東京です。"
Long context ✅ needle-in-haystack retrieved at 4K / 7K / 14K / 30K tokens (with the needle in the middle of the context, not just the start)
Accuracy (PPL) wikitext-2 ≈ 5.2–6.1 (BF16 baseline ≈ 5.3) ⇒ ~+14% from 3.2-bit; numerically faithful
Footprint 87.6 GiB per GPU (×2), KV 46,464 tokens
Throughput ~13 tok/s on 2× RTX PRO 6000 (Blackwell)

Hardware / software

  • GPUs: 2× RTX PRO 6000 Blackwell, sm_120 (cap 12.0), ~95 GiB each.
  • vLLM: 0.23.1 (native MiniMaxM3SparseForCausalLM).
  • OneComp gptq plugin: https://github.com/mmzz164/OneCompression (vllm_plugins/gptq, MIT) — pin tag m3-serving-v1 (this port needs the vLLM ≥ 0.22 / auto_gptq + swigluoai revision; the default main is the older vLLM < 0.22 serving used by the Nex-N2-Pro card).
  • Checkpoint: AutoRound 3.2-bit (MiniMax-M3-w16g128, ~176 GiB).

Files (this port)

file role
m3_quant.py registers the autoround_mixed quant config; parses AutoRound per-module bits; shared-expert block_sparse_moe.* bits alias; keeps qkv_proj bf16 only
m3_official_loader.py key-translating loader: our transformers-VL names → official names; de-quants the fused-indexer qkv to bf16; splits the fused GPTQ gate_up into quantized gate/up shards
serve_m3_official.py one-shot generate + _override_quant_method (the config FORCE dict incl. n_shared_experts=1, swigluoai, rope θ=5e6, TRITON_ATTN); M3_LONGTEST=1 = needle test
m3_official_api_server.py OpenAI-compatible API server (:8003, model minimax-m3-long)
ppl_official_offline.py wikitext-2 PPL via prompt_logprobs

Configuration

The scripts auto-detect their own directory (__file__), so just keep the code and the weight files together (as in this repo). Configure via env vars:

env var purpose default
ONECOMP_PATH path to your clone of https://github.com/mmzz164/OneCompression at tag m3-serving-v1 OneCompression (cwd)
M3_CKPT dir holding the quantized weights (*.safetensors, config.json, …) the script's own dir (this repo)

Other knobs: M3_MAXLEN (default 40960), M3_ATTN_BACKEND (default TRITON_ATTN), M3_PORT (8003), M3_EAGER (default 1; 0=cudagraph crashes on sm_120, keep 1).

Run

pip install "vllm==0.23.1"            # native MiniMax-M3 (vLLM >= 0.22 required)
# clone OneComp and check out the M3 serving tag (NOT main — main is the older Nex serving):
git clone https://github.com/mmzz164/OneCompression && git -C OneCompression checkout m3-serving-v1
export ONECOMP_PATH=$PWD/OneCompression
# run from the dir holding these files + the weights (M3_OFFICIAL_PORT is set internally):

# OpenAI API server (long context, :8003)
python m3_official_api_server.py
curl http://localhost:8003/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"minimax-m3-long","messages":[{"role":"user","content":"日本の首都は?"}],"max_tokens":30}'

# long-context needle test
M3_LONGTEST=1 python serve_m3_official.py

Key engine args (see serve_m3_official.py / m3_official_api_server.py): tensor_parallel_size=2, enable_expert_parallel=True, block_size=128, attention_backend="TRITON_ATTN", gpu_memory_utilization=0.97, disable_custom_all_reduce=True, dtype="bfloat16".


Why the loader is needed (the fixes)

Our checkpoint was exported from the transformers MiniMax-M3-VL model, so its key names and quant-config naming differ from what the official vLLM model expects. Loading it naively silently drops the shared expert in every MoE layer → garbage. Two root causes, both fixed here:

  1. Shared-expert module not built — the official MiniMaxM3MoE builds the shared expert only if config.n_shared_experts is truthy. The checkpoint nests n_shared_experts=1 under text_config; forcing the text-only architecture makes the model read the top-level config (no such key). → forced in the hf_overrides FORCE dict.
  2. Shared-expert weights not loaded — the per-module quant bits are keyed by the checkpoint suffix mlp.shared_experts.*, but the official model queries block_sparse_moe.shared_experts.* (startswith match) → miss → treated as unquantized → our quantized weight does not fit the param → loads as zero. → bits alias added in m3_quant.py; the fused gate_up is split into quantized gate/up shards and down_proj stays quantized in the loader.

Other notes:

  • attention_backend="TRITON_ATTN" is required: the MSA sparse layers need block_size=128, and FLASHINFER for the dense layers has no common block size with it (No common block size for 128). TRITON_ATTN is numerically correct on sm_120 (verified cos=1.0 vs a full-attention reference, dense + sparse).
  • On sm_120 the main sparse impl is the Triton kernel, not fmha_sm100 (that is SM100/datacenter-Blackwell only).
  • disable_custom_all_reduce=True (custom all-reduce is unreliable on sm_120).

License & attribution

This is a quantized derivative of MiniMaxAI/MiniMax-M3, redistributed under the MiniMax Community License (license_name: minimax-community). The full license text is included as LICENSE (copied verbatim from the base model) and must be shipped with these weights.

  • Non-commercial use: the base license permits use/copy/modify/publish/ distribute for non-commercial purposes — so this quantized checkpoint may be redistributed freely for research/personal use with attribution + the LICENSE file included.
  • Commercial use (downstream users): per the MiniMax Community License, if you use this model (a derivative of MiniMax-M3) commercially you must (1) prominently display “Built with MiniMax M3”, and (2) send a one-time notice to api@minimax.io (subject “M3 licensing — notice”), or obtain prior written authorization if your product earns > US$20M/yr. See LICENSE for the exact terms and the Prohibited-Uses appendix.
  • This port's glue code (m3_quant.py, m3_official_loader.py, serve_m3_official.py, m3_official_api_server.py) and the OneComp gptq plugin are MIT.

Built with MiniMax M3.

Downloads last month
785
Safetensors
Model size
50B params
Tensor type
I32
·
BF16
·
F16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aquaman164/MiniMax-M3-AutoRound-3.2bit-longctx

Quantized
(33)
this model