MiMo-V2.5 — AWQ W4A16 (int4), A100-ready

4-bit (AWQ W4A16, routed-experts-only) quantization of MiMo-V2.5, packaged to serve on NVIDIA A100 (SM80) under stock vLLM 0.21.0 — text + vision, at TP-4 or TP-8.

The base model ships fp8 (Hopper-native) and does not run on A100. This repo is the A100 path: the int4 weights plus the two small vLLM model-code patches that make MiMo's Hopper-only attention run on Ampere — without changing the math.

  • bf16 → int4: ≈581 GB → 169 GB
  • Context: up to 1,048,576 tokens · text + vision · tool-calling + reasoning

Use this model (vLLM)

MiMo's default vLLM path hard-selects a FlashAttention-3 backend (SM90+ only). Bind-mount the two patch files over the image copies (details in PATCHES.md):

docker run --rm --gpus all --ipc=host -p 8000:8000 \
  -v /path/to/MiMo-V2.5-AWQ-int4:/model:ro \
  -v /path/to/vllm-patches/mimo_v2.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2.py:ro \
  -v /path/to/vllm-patches/mimo_v2_omni.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2_omni.py:ro \
  vllm/vllm-openai:v0.21.0 \
  --model /model \
  --served-model-name mimo-v2.5 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_xml \
  --trust-remote-code \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90
  • TP-4 works too — set --tensor-parallel-size 4. (The patches are correct at both; see the QKV note in PATCHES.md.)
  • Sampling: temperature 1.0, top_p 0.95 (the model's shipped generation_config.json; thinking-mode on).
  • --max-model-len can be raised toward the native 1,048,576 as VRAM allows.

Files

file what
model-0000{1..4}-of-00004.safetensors int4 weights (W4A16)
config.json, recipe.yaml quant config + the full quantization recipe
modeling_mimo_v2.py, configuration_mimo_v2.py model code (--trust-remote-code)
tokenizer*, chat_template.jinja, preprocessor_config.json tokenizer + chat / vision preprocessing
vllm-patches/mimo_v2.py, vllm-patches/mimo_v2_omni.py the two vLLM serving patches — mount over the image copies
vllm-patches/PATCHES.md full patch writeup + validation

Method

AWQ W4A16, routed-experts-only. Only the MoE routed-expert projections (mlp.experts.*_proj) are quantized to 4-bit. Everything quality-sensitive stays high-precision: attention, the router/gate, shared paths, embeddings & lm_head, MTP, and the vision + audio towers are all left untouched. In addition, layer-41's experts are kept at bf16 (a composite carve-out — that one layer quantized worst).

This is why MoE tolerates 4-bit far better than dense models: the sensitive machinery is untouched and only the redundant expert bulk is compressed. The exact llm-compressor recipe (group-wise AWQ, smoothing maps, ignore list) is in recipe.yaml.

A100 serving patches

Base MiMo-V2.5 is Hopper-only in vLLM for architectural, not precision, reasons: it uses SWA attention sinks + asymmetric head dims (qk=192, v=128), so vLLM selects a FlashAttention-3 backend that asserts SM90+. On A100, FA2 supports neither sinks nor asymmetric V.

The two files fix this on stock vLLM 0.21.0, exactly (no approximation):

  1. Triton attention on SM80 — branch the backend by device capability: Hopper keeps the native FA3 path; SM80 uses the Triton backend, which supports attention sinks on Ampere.
  2. V head-dim padding (128 → 192) — Triton needs K and V at the same head size; pad V with zeros before attention and slice it back off the output. Provably exact.
  3. Fused-QKV de-shard / re-shard (the TP-8 fix) — the checkpoint's fused qkv_proj is pre-sharded for TP-4; a naive chunk() silently corrupts K/V at TP-8. The patch de-shards to canonical Q/K/V then re-shards for the serving TP — exact at TP-4, correct at TP-8 for both full and sliding-window layers.
  4. (vision) Merger LayerNorm fix (mimo_v2_omni.py) — matches the checkpoint's own LayerNorm + biased-linear merger (vLLM's copy used RMSNorm + bias-less).

Full derivation, validation, and the one checkpoint-specific assumption (NB=4, the quant-time TP the fused QKV is pre-sharded for) are in PATCHES.md. Bind-mounting is the zero-rebuild path; baking the two files into a derived image is the clean end-state.

Quality

Routed-experts-only 4-bit is near-lossless here: measured symmetric-KL vs the bf16 reference ≈ 0.046 — in the "good MoE 4-bit" range (dense 4-bit is typically ~0.05+). The high-precision attention/router/shared paths plus the layer-41 bf16 carve-out are what keep it faithful.

License & credits

Downloads last month
-
Safetensors
Model size
53B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spectator2026/MiMo-V2.5-AWQ-int4

Quantized
(24)
this model