MiniMax-M3-MXFP8-64e (50% expert pruning)

Training-free expert-pruned variant of MiniMaxAI/MiniMax-M3-MXFP8. Each MoE layer is reduced from 128 → 64 routed experts (pruning ratio 50%), keeping num_experts_per_tok = 4 and the shared expert intact.

value
Source MiniMaxAI/MiniMax-M3-MXFP8 (~428B total, ~23B active, MXFP8)
Routed experts/layer 64 (was 128)
Pruning ratio 50%
MoE layers pruned 57 (layers 3–59; layers 0–2 are dense)
Top-k routing 4 (unchanged)
Shared expert kept (1, unchanged)
Size ~231 GB (was ~444 GB)
Fits 4× H100 NVL (TP=4)

Method (training-free, no fine-tune)

Routing-mass importance calibration, following the expert_pruning methodology (adapted to M3's 128-expert / top-4 / MXFP8 / shared-expert MoE):

  1. Calibration — 64 mixed prompts (AI4Code / Nokia, general English, multilingual, reasoning) run through the unpruned model in vLLM (TP=8).
  2. Importance — per MoE layer, accumulate each expert's selected probability mass: sigmoid(router_logits) (+ e_score_correction_bias) → top-4 → renormalized weights summed over all calibration tokens. NaN/Inf masses from rare degenerate tokens are treated as lowest priority.
  3. Select — keep the top-64 experts per layer (multiple of 8, EP-clean), deterministic tie-break by (mass desc, index asc). 57/57 layers had a non-negative kept/drop margin (median gap ≈ 1.0).
  4. Slice — atomic per-layer surgery: gate.weight row-slice [kept], e_score_correction_bias[kept], drop unkept experts' six MXFP8 tensors (w{1,2,3}.{weight, weight_scale_inv}), renumber survivors 0..63. FP8 weights and their block weight_scale_inv scales are copied whole — no dequant. config.num_local_experts = 64. Everything else (attention/MSA, shared experts, dense layers, vision tower, projector, embeddings, lm_head) is byte-identical to the source.
  5. Verify — every MoE layer has exactly 64 contiguous experts × 6 tensors, gate (64, 6144), bias (64,).

"Rerouting" is handled by construction: top-4 over the surviving 64 experts re-normalizes automatically; a token whose first-choice expert was dropped falls through to its next-best survivor.

Serving (vLLM)

Requires the MiniMax-M3 vLLM build (M3 support is not yet in a stable release):

vllm serve morriszjm/MiniMax-M3-MXFP8-64e \
  --tensor-parallel-size 4 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --max-model-len 32768

Verified booting and answering on 4× H100 NVL.

Limitations

  • Training-free: no fine-tune / distillation recovery. Expect quality regression vs. the unpruned model — coherent, grammatical, on-topic answers, but more hallucination on factual recall at 50% pruning.
  • Importance is text-calibrated; vision/multimodal-specific expert utility was not separately analyzed.
  • Uniform per-layer K (v1). Per-layer adaptive K is future work.

Produced by the Nokia onboarding_demo/expert_pruning pipeline (M3 adaptation).

Downloads last month
829
Safetensors
Model size
227B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for morriszjm/MiniMax-M3-MXFP8-64e

Quantized
(1)
this model