Configuration Parsing Warning:In config.json: "quantization_config.bits" must be less than or equal to 8
MiniMax-M3 · AutoRound 3.2-bit · long-context vLLM port
Mixed 2/3/4/8-bit (AutoRound) quantization of MiniMax-M3 (~428B-param MoE, 60 layers, 128 experts top-4 + 1 shared expert) served on the official native vLLM 0.23.1 MiniMax-M3 model. Unlike the out-of-tree port (≤2048 ctx), this uses the model's MSA "lightning indexer" so it does long context for free — validated to 30K tokens (KV holds ~46K) on 2× RTX PRO 6000 Blackwell.
This card documents the integration (checkpoint + loader + serving). The quantization itself is OneComp's mixed-bit GPTQ pipeline.
Status / results (2026-06-17)
| Correctness | ✅ "The capital of France is" → " Paris. …London…Madrid"; "日本の首都は" → "東京です。" |
| Long context | ✅ needle-in-haystack retrieved at 4K / 7K / 14K / 30K tokens (with the needle in the middle of the context, not just the start) |
| Accuracy (PPL) | wikitext-2 ≈ 5.2–6.1 (BF16 baseline ≈ 5.3) ⇒ ~+14% from 3.2-bit; numerically faithful |
| Footprint | 87.6 GiB per GPU (×2), KV 46,464 tokens |
| Throughput | ~13 tok/s on 2× RTX PRO 6000 (Blackwell) |
Hardware / software
- GPUs: 2× RTX PRO 6000 Blackwell, sm_120 (cap 12.0), ~95 GiB each.
- vLLM: 0.23.1 (native
MiniMaxM3SparseForCausalLM). - OneComp gptq plugin: https://github.com/mmzz164/OneCompression (
vllm_plugins/gptq, MIT) — pin tagm3-serving-v1(this port needs the vLLM ≥ 0.22 /auto_gptq+swigluoairevision; the defaultmainis the older vLLM < 0.22 serving used by the Nex-N2-Pro card). - Checkpoint: AutoRound 3.2-bit (
MiniMax-M3-w16g128, ~176 GiB).
Files (this port)
| file | role |
|---|---|
m3_quant.py |
registers the autoround_mixed quant config; parses AutoRound per-module bits; shared-expert block_sparse_moe.* bits alias; keeps qkv_proj bf16 only |
m3_official_loader.py |
key-translating loader: our transformers-VL names → official names; de-quants the fused-indexer qkv to bf16; splits the fused GPTQ gate_up into quantized gate/up shards |
serve_m3_official.py |
one-shot generate + _override_quant_method (the config FORCE dict incl. n_shared_experts=1, swigluoai, rope θ=5e6, TRITON_ATTN); M3_LONGTEST=1 = needle test |
m3_official_api_server.py |
OpenAI-compatible API server (:8003, model minimax-m3-long) |
ppl_official_offline.py |
wikitext-2 PPL via prompt_logprobs |
Configuration
The scripts auto-detect their own directory (__file__), so just keep the code
and the weight files together (as in this repo). Configure via env vars:
| env var | purpose | default |
|---|---|---|
ONECOMP_PATH |
path to your clone of https://github.com/mmzz164/OneCompression at tag m3-serving-v1 |
OneCompression (cwd) |
M3_CKPT |
dir holding the quantized weights (*.safetensors, config.json, …) |
the script's own dir (this repo) |
Other knobs: M3_MAXLEN (default 40960), M3_ATTN_BACKEND (default TRITON_ATTN),
M3_PORT (8003), M3_EAGER (default 1; 0=cudagraph crashes on sm_120, keep 1).
Run
pip install "vllm==0.23.1" # native MiniMax-M3 (vLLM >= 0.22 required)
# clone OneComp and check out the M3 serving tag (NOT main — main is the older Nex serving):
git clone https://github.com/mmzz164/OneCompression && git -C OneCompression checkout m3-serving-v1
export ONECOMP_PATH=$PWD/OneCompression
# run from the dir holding these files + the weights (M3_OFFICIAL_PORT is set internally):
# OpenAI API server (long context, :8003)
python m3_official_api_server.py
curl http://localhost:8003/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"minimax-m3-long","messages":[{"role":"user","content":"日本の首都は?"}],"max_tokens":30}'
# long-context needle test
M3_LONGTEST=1 python serve_m3_official.py
Key engine args (see serve_m3_official.py / m3_official_api_server.py):
tensor_parallel_size=2, enable_expert_parallel=True, block_size=128, attention_backend="TRITON_ATTN", gpu_memory_utilization=0.97, disable_custom_all_reduce=True, dtype="bfloat16".
Why the loader is needed (the fixes)
Our checkpoint was exported from the transformers MiniMax-M3-VL model, so its key names and quant-config naming differ from what the official vLLM model expects. Loading it naively silently drops the shared expert in every MoE layer → garbage. Two root causes, both fixed here:
- Shared-expert module not built — the official
MiniMaxM3MoEbuilds the shared expert only ifconfig.n_shared_expertsis truthy. The checkpoint nestsn_shared_experts=1undertext_config; forcing the text-only architecture makes the model read the top-level config (no such key). → forced in thehf_overridesFORCE dict. - Shared-expert weights not loaded — the per-module quant bits are keyed by
the checkpoint suffix
mlp.shared_experts.*, but the official model queriesblock_sparse_moe.shared_experts.*(startswithmatch) → miss → treated as unquantized → our quantized weight does not fit the param → loads as zero. → bits alias added inm3_quant.py; the fusedgate_upis split into quantized gate/up shards anddown_projstays quantized in the loader.
Other notes:
attention_backend="TRITON_ATTN"is required: the MSA sparse layers needblock_size=128, and FLASHINFER for the dense layers has no common block size with it (No common block size for 128). TRITON_ATTN is numerically correct on sm_120 (verified cos=1.0 vs a full-attention reference, dense + sparse).- On sm_120 the main sparse impl is the Triton kernel, not
fmha_sm100(that is SM100/datacenter-Blackwell only). disable_custom_all_reduce=True(custom all-reduce is unreliable on sm_120).
License & attribution
This is a quantized derivative of MiniMaxAI/MiniMax-M3, redistributed under the
MiniMax Community License (license_name: minimax-community). The full license
text is included as LICENSE (copied verbatim from the base model) and must be
shipped with these weights.
- Non-commercial use: the base license permits use/copy/modify/publish/ distribute for non-commercial purposes — so this quantized checkpoint may be redistributed freely for research/personal use with attribution + the LICENSE file included.
- Commercial use (downstream users): per the MiniMax Community License, if you
use this model (a derivative of MiniMax-M3) commercially you must (1) prominently
display “Built with MiniMax M3”, and (2) send a one-time notice to
api@minimax.io(subject “M3 licensing — notice”), or obtain prior written authorization if your product earns > US$20M/yr. SeeLICENSEfor the exact terms and the Prohibited-Uses appendix. - This port's glue code (
m3_quant.py,m3_official_loader.py,serve_m3_official.py,m3_official_api_server.py) and the OneComp gptq plugin are MIT.
Built with MiniMax M3.
- Downloads last month
- 785
Model tree for aquaman164/MiniMax-M3-AutoRound-3.2bit-longctx
Base model
MiniMaxAI/MiniMax-M3