MiMo-V2.5 — AWQ W4A16 (int4), A100-ready
4-bit (AWQ W4A16, routed-experts-only) quantization of MiMo-V2.5, packaged to serve on NVIDIA A100 (SM80) under stock vLLM 0.21.0 — text + vision, at TP-4 or TP-8.
The base model ships fp8 (Hopper-native) and does not run on A100. This repo is the A100 path: the int4 weights plus the two small vLLM model-code patches that make MiMo's Hopper-only attention run on Ampere — without changing the math.
- bf16 → int4: ≈581 GB → 169 GB
- Context: up to 1,048,576 tokens · text + vision · tool-calling + reasoning
Use this model (vLLM)
MiMo's default vLLM path hard-selects a FlashAttention-3 backend (SM90+ only). Bind-mount the two patch files over the image copies (details in PATCHES.md):
docker run --rm --gpus all --ipc=host -p 8000:8000 \
-v /path/to/MiMo-V2.5-AWQ-int4:/model:ro \
-v /path/to/vllm-patches/mimo_v2.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2.py:ro \
-v /path/to/vllm-patches/mimo_v2_omni.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2_omni.py:ro \
vllm/vllm-openai:v0.21.0 \
--model /model \
--served-model-name mimo-v2.5 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--trust-remote-code \
--max-model-len 262144 \
--gpu-memory-utilization 0.90
- TP-4 works too — set
--tensor-parallel-size 4. (The patches are correct at both; see the QKV note inPATCHES.md.) - Sampling:
temperature 1.0, top_p 0.95(the model's shippedgeneration_config.json; thinking-mode on). --max-model-lencan be raised toward the native 1,048,576 as VRAM allows.
Files
| file | what |
|---|---|
model-0000{1..4}-of-00004.safetensors |
int4 weights (W4A16) |
config.json, recipe.yaml |
quant config + the full quantization recipe |
modeling_mimo_v2.py, configuration_mimo_v2.py |
model code (--trust-remote-code) |
tokenizer*, chat_template.jinja, preprocessor_config.json |
tokenizer + chat / vision preprocessing |
vllm-patches/mimo_v2.py, vllm-patches/mimo_v2_omni.py |
the two vLLM serving patches — mount over the image copies |
vllm-patches/PATCHES.md |
full patch writeup + validation |
Method
AWQ W4A16, routed-experts-only. Only the MoE routed-expert projections (mlp.experts.*_proj) are quantized to 4-bit. Everything quality-sensitive stays high-precision: attention, the router/gate, shared paths, embeddings & lm_head, MTP, and the vision + audio towers are all left untouched. In addition, layer-41's experts are kept at bf16 (a composite carve-out — that one layer quantized worst).
This is why MoE tolerates 4-bit far better than dense models: the sensitive machinery is untouched and only the redundant expert bulk is compressed. The exact llm-compressor recipe (group-wise AWQ, smoothing maps, ignore list) is in recipe.yaml.
A100 serving patches
Base MiMo-V2.5 is Hopper-only in vLLM for architectural, not precision, reasons: it uses SWA attention sinks + asymmetric head dims (qk=192, v=128), so vLLM selects a FlashAttention-3 backend that asserts SM90+. On A100, FA2 supports neither sinks nor asymmetric V.
The two files fix this on stock vLLM 0.21.0, exactly (no approximation):
- Triton attention on SM80 — branch the backend by device capability: Hopper keeps the native FA3 path; SM80 uses the Triton backend, which supports attention sinks on Ampere.
- V head-dim padding (128 → 192) — Triton needs K and V at the same head size; pad V with zeros before attention and slice it back off the output. Provably exact.
- Fused-QKV de-shard / re-shard (the TP-8 fix) — the checkpoint's fused
qkv_projis pre-sharded for TP-4; a naivechunk()silently corrupts K/V at TP-8. The patch de-shards to canonical Q/K/V then re-shards for the serving TP — exact at TP-4, correct at TP-8 for both full and sliding-window layers. - (vision) Merger LayerNorm fix (
mimo_v2_omni.py) — matches the checkpoint's ownLayerNorm+ biased-linear merger (vLLM's copy used RMSNorm + bias-less).
Full derivation, validation, and the one checkpoint-specific assumption (NB=4, the quant-time TP the fused QKV is pre-sharded for) are in PATCHES.md. Bind-mounting is the zero-rebuild path; baking the two files into a derived image is the clean end-state.
Quality
Routed-experts-only 4-bit is near-lossless here: measured symmetric-KL vs the bf16 reference ≈ 0.046 — in the "good MoE 4-bit" range (dense 4-bit is typically ~0.05+). The high-precision attention/router/shared paths plus the layer-41 bf16 carve-out are what keep it faithful.
License & credits
- Original model: MiMo-V2.5 © Xiaomi, MIT license.
- int4 quantization + A100 vLLM patches by @spectator2026.
- Downloads last month
- -
Model tree for spectator2026/MiMo-V2.5-AWQ-int4
Base model
XiaomiMiMo/MiMo-V2.5