Configuration Parsing Warning:In config.json: "quantization_config.bits" must be less than or equal to 8

MiniMax-M3 · AutoRound 3.2-bit · long-context vLLM port

Mixed 2/3/4/8-bit (AutoRound) quantization of MiniMax-M3 (~428B-param MoE, 60 layers, 128 experts top-4 + 1 shared expert) served on the official native vLLM 0.23.1 MiniMax-M3 model. Unlike the out-of-tree port (≤2048 ctx), this uses the model's MSA "lightning indexer" so it does long context for free — validated to 30K tokens (KV holds ~46K) on 2× RTX PRO 6000 Blackwell.

This card documents the integration (checkpoint + loader + serving). The quantization itself is OneComp's mixed-bit GPTQ pipeline.

Status / results (2026-06-17)


Correctness	✅ "The capital of France is" → " Paris. …London…Madrid"; "日本の首都は" → "東京です。"
Long context	✅ needle-in-haystack retrieved at 4K / 7K / 14K / 30K tokens (with the needle in the middle of the context, not just the start)
Accuracy (PPL)	wikitext-2 ≈ 5.2–6.1 (BF16 baseline ≈ 5.3) ⇒ ~+14% from 3.2-bit; numerically faithful
Footprint	87.6 GiB per GPU (×2), KV 46,464 tokens
Throughput	~13 tok/s on 2× RTX PRO 6000 (Blackwell)

Hardware / software

GPUs: 2× RTX PRO 6000 Blackwell, sm_120 (cap 12.0), ~95 GiB each.
vLLM: 0.23.1 (native MiniMaxM3SparseForCausalLM).
OneComp gptq plugin: https://github.com/mmzz164/OneCompression (vllm_plugins/gptq, MIT) — pin tag m3-serving-v1 (this port needs the vLLM ≥ 0.22 / auto_gptq + swigluoai revision; the default main is the older vLLM < 0.22 serving used by the Nex-N2-Pro card).
Checkpoint: AutoRound 3.2-bit (MiniMax-M3-w16g128, ~176 GiB).

Files (this port)

file	role
`m3_quant.py`	registers the `autoround_mixed` quant config; parses AutoRound per-module bits; *shared-expert `block_sparse_moe.` bits alias**; keeps `qkv_proj` bf16 only
`m3_official_loader.py`	key-translating loader: our transformers-VL names → official names; de-quants the fused-indexer qkv to bf16; splits the fused GPTQ gate_up into quantized gate/up shards
`serve_m3_official.py`	one-shot generate + `_override_quant_method` (the config FORCE dict incl. `n_shared_experts=1`, swigluoai, rope θ=5e6, TRITON_ATTN); `M3_LONGTEST=1` = needle test
`m3_official_api_server.py`	OpenAI-compatible API server (`:8003`, model `minimax-m3-long`)
`ppl_official_offline.py`	wikitext-2 PPL via prompt_logprobs

Configuration

The scripts auto-detect their own directory (__file__), so just keep the code and the weight files together (as in this repo). Configure via env vars:

env var	purpose	default
`ONECOMP_PATH`	path to your clone of https://github.com/mmzz164/OneCompression at tag `m3-serving-v1`	`OneCompression` (cwd)
`M3_CKPT`	dir holding the quantized weights (`*.safetensors`, `config.json`, …)	the script's own dir (this repo)

Other knobs: M3_MAXLEN (default 40960), M3_ATTN_BACKEND (default TRITON_ATTN), M3_PORT (8003), M3_EAGER (default 1; 0=cudagraph crashes on sm_120, keep 1).

Run

pip install "vllm==0.23.1"            # native MiniMax-M3 (vLLM >= 0.22 required)
# clone OneComp and check out the M3 serving tag (NOT main — main is the older Nex serving):
git clone https://github.com/mmzz164/OneCompression && git -C OneCompression checkout m3-serving-v1
export ONECOMP_PATH=$PWD/OneCompression
# run from the dir holding these files + the weights (M3_OFFICIAL_PORT is set internally):

# OpenAI API server (long context, :8003)
python m3_official_api_server.py
curl http://localhost:8003/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"minimax-m3-long","messages":[{"role":"user","content":"日本の首都は?"}],"max_tokens":30}'

# long-context needle test
M3_LONGTEST=1 python serve_m3_official.py

Key engine args (see serve_m3_official.py / m3_official_api_server.py): tensor_parallel_size=2, enable_expert_parallel=True, block_size=128, attention_backend="TRITON_ATTN", gpu_memory_utilization=0.97, disable_custom_all_reduce=True, dtype="bfloat16".

Why the loader is needed (the fixes)

Our checkpoint was exported from the transformers MiniMax-M3-VL model, so its key names and quant-config naming differ from what the official vLLM model expects. Loading it naively silently drops the shared expert in every MoE layer → garbage. Two root causes, both fixed here:

Shared-expert module not built — the official MiniMaxM3MoE builds the shared expert only if config.n_shared_experts is truthy. The checkpoint nests n_shared_experts=1 under text_config; forcing the text-only architecture makes the model read the top-level config (no such key). → forced in the hf_overrides FORCE dict.
Shared-expert weights not loaded — the per-module quant bits are keyed by the checkpoint suffix mlp.shared_experts.*, but the official model queries block_sparse_moe.shared_experts.* (startswith match) → miss → treated as unquantized → our quantized weight does not fit the param → loads as zero. → bits alias added in m3_quant.py; the fused gate_up is split into quantized gate/up shards and down_proj stays quantized in the loader.

Other notes:

attention_backend="TRITON_ATTN" is required: the MSA sparse layers need block_size=128, and FLASHINFER for the dense layers has no common block size with it (No common block size for 128). TRITON_ATTN is numerically correct on sm_120 (verified cos=1.0 vs a full-attention reference, dense + sparse).
On sm_120 the main sparse impl is the Triton kernel, not fmha_sm100 (that is SM100/datacenter-Blackwell only).
disable_custom_all_reduce=True (custom all-reduce is unreliable on sm_120).

License & attribution

This is a quantized derivative of MiniMaxAI/MiniMax-M3, redistributed under the MiniMax Community License (license_name: minimax-community). The full license text is included as LICENSE (copied verbatim from the base model) and must be shipped with these weights.

Non-commercial use: the base license permits use/copy/modify/publish/ distribute for non-commercial purposes — so this quantized checkpoint may be redistributed freely for research/personal use with attribution + the LICENSE file included.
Commercial use (downstream users): per the MiniMax Community License, if you use this model (a derivative of MiniMax-M3) commercially you must (1) prominently display “Built with MiniMax M3”, and (2) send a one-time notice to api@minimax.io (subject “M3 licensing — notice”), or obtain prior written authorization if your product earns > US$20M/yr. See LICENSE for the exact terms and the Prohibited-Uses appendix.
This port's glue code (m3_quant.py, m3_official_loader.py, serve_m3_official.py, m3_official_api_server.py) and the OneComp gptq plugin are MIT.

Built with MiniMax M3.

Downloads last month: 785

Safetensors

Model size

50B params

Tensor type

I32

BF16

F16

F32

Model tree for aquaman164/MiniMax-M3-AutoRound-3.2bit-longctx

Base model

MiniMaxAI/MiniMax-M3

Quantized

(33)

this model