Instructions to use morriszjm/MiniMax-M3-MXFP8-64e with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use morriszjm/MiniMax-M3-MXFP8-64e with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="morriszjm/MiniMax-M3-MXFP8-64e", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("morriszjm/MiniMax-M3-MXFP8-64e", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("morriszjm/MiniMax-M3-MXFP8-64e", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use morriszjm/MiniMax-M3-MXFP8-64e with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "morriszjm/MiniMax-M3-MXFP8-64e" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "morriszjm/MiniMax-M3-MXFP8-64e", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/morriszjm/MiniMax-M3-MXFP8-64e
- SGLang
How to use morriszjm/MiniMax-M3-MXFP8-64e with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "morriszjm/MiniMax-M3-MXFP8-64e" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "morriszjm/MiniMax-M3-MXFP8-64e", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "morriszjm/MiniMax-M3-MXFP8-64e" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "morriszjm/MiniMax-M3-MXFP8-64e", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use morriszjm/MiniMax-M3-MXFP8-64e with Docker Model Runner:
docker model run hf.co/morriszjm/MiniMax-M3-MXFP8-64e
MiniMax-M3-MXFP8-64e (50% expert pruning)
Training-free expert-pruned variant of
MiniMaxAI/MiniMax-M3-MXFP8.
Each MoE layer is reduced from 128 → 64 routed experts (pruning ratio
50%), keeping num_experts_per_tok = 4 and the shared expert intact.
| value | |
|---|---|
| Source | MiniMaxAI/MiniMax-M3-MXFP8 (~428B total, ~23B active, MXFP8) |
| Routed experts/layer | 64 (was 128) |
| Pruning ratio | 50% |
| MoE layers pruned | 57 (layers 3–59; layers 0–2 are dense) |
| Top-k routing | 4 (unchanged) |
| Shared expert | kept (1, unchanged) |
| Size | ~231 GB (was ~444 GB) |
| Fits | 4× H100 NVL (TP=4) |
Method (training-free, no fine-tune)
Routing-mass importance calibration, following the expert_pruning
methodology (adapted to M3's 128-expert / top-4 / MXFP8 / shared-expert MoE):
- Calibration — 64 mixed prompts (AI4Code / Nokia, general English, multilingual, reasoning) run through the unpruned model in vLLM (TP=8).
- Importance — per MoE layer, accumulate each expert's selected
probability mass:
sigmoid(router_logits) (+ e_score_correction_bias)→ top-4 → renormalized weights summed over all calibration tokens. NaN/Inf masses from rare degenerate tokens are treated as lowest priority. - Select — keep the top-64 experts per layer (multiple of 8, EP-clean),
deterministic tie-break by
(mass desc, index asc). 57/57 layers had a non-negative kept/drop margin (median gap ≈ 1.0). - Slice — atomic per-layer surgery:
gate.weightrow-slice[kept],e_score_correction_bias[kept], drop unkept experts' six MXFP8 tensors (w{1,2,3}.{weight, weight_scale_inv}), renumber survivors0..63. FP8 weights and their blockweight_scale_invscales are copied whole — no dequant.config.num_local_experts = 64. Everything else (attention/MSA, shared experts, dense layers, vision tower, projector, embeddings, lm_head) is byte-identical to the source. - Verify — every MoE layer has exactly 64 contiguous experts × 6 tensors,
gate (64, 6144),bias (64,).
"Rerouting" is handled by construction: top-4 over the surviving 64 experts re-normalizes automatically; a token whose first-choice expert was dropped falls through to its next-best survivor.
Serving (vLLM)
Requires the MiniMax-M3 vLLM build (M3 support is not yet in a stable release):
vllm serve morriszjm/MiniMax-M3-MXFP8-64e \
--tensor-parallel-size 4 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice \
--trust-remote-code \
--max-model-len 32768
Verified booting and answering on 4× H100 NVL.
Limitations
- Training-free: no fine-tune / distillation recovery. Expect quality regression vs. the unpruned model — coherent, grammatical, on-topic answers, but more hallucination on factual recall at 50% pruning.
- Importance is text-calibrated; vision/multimodal-specific expert utility was not separately analyzed.
- Uniform per-layer K (v1). Per-layer adaptive K is future work.
Produced by the Nokia onboarding_demo/expert_pruning pipeline (M3 adaptation).
- Downloads last month
- 829