MiMo-V2.5 โ€” text-only, packaged for MLX

This is a convenience repackaging of XiaomiMiMo/MiMo-V2.5 for inference on Apple Silicon via MLX.

Every individual weight value is bit-identical to the upstream release. This repository repacks the 32 source safetensors shards into a single file, stacks per-expert MoE weights along a new leading axis (required by MLX MoE loaders), and excludes multimodal weights (vision, audio, MTP heads) since this targets the text-only inference path.

Validated on M3 Ultra 512 GB: 30.2 tok/s decode, ~535 tok/s prefill at L=2048.

Prerequisites

Three pieces are needed to run this model. None are merged upstream yet (as of this release); the README will be updated as they land.

1. MLX with block_fp8 kernels

ml-explore/mlx#3600 โ€” block_fp8: 2D-block FP8 quantized matmul + MoE kernels

Until merged, build MLX from the PR branch:

git clone --branch fp8-block-mvp https://github.com/yohann-bearzi/mlx.git
cd mlx
CMAKE_BUILD_PARALLEL_LEVEL=$(sysctl -n hw.ncpu) \
  python3 -m pip install -e . --no-build-isolation

2. mlx_lm with mimo_v2 base support

ml-explore/mlx-lm#1219 โ€” MiMo-V2 model support

Install mlx_lm from that branch.

3. The block_fp8 model class (shipped in this repo)

Copy mimo_v2_block_fp8.py into your mlx_lm models directory:

cp mimo_v2_block_fp8.py $(python3 -c 'import mlx_lm, os; print(os.path.dirname(mlx_lm.__file__))')/models/

Usage

import json, mlx.core as mx
import mlx_lm.utils as U

mx.set_wired_limit(300 * 1024**3)  # M3 Ultra 512GB

cfg = json.load(open("config.json"))
mc, ac = U._get_classes(cfg)
m = mc(ac.from_dict(cfg))

w = mx.load("mimo_v2.5_block_fp8.safetensors")
w = m.sanitize_block_fp8(w)
m.apply_block_fp8(w)
del w
import gc; gc.collect()
mx.eval(m.parameters())

Reproducing from source

If you'd rather convert from XiaomiMiMo's release yourself, this repo ships the converter script. Download Xiaomi's shards, then:

python3 convert_mimo.py --src /path/to/XiaomiMiMo/MiMo-V2.5 \
                        --out /path/to/output/mimo_v2.5_block_fp8.safetensors

The converter does two things only:

  1. Concatenates the 32 source safetensors shards into a single file.
  2. Stacks the 256 per-expert MoE weights per layer into one tensor with shape [256, ...] per projection (required by MLX MoE loaders).

No quantization, permutation, scale manipulation, or padding is applied. Conversion takes ~10 minutes on a fast SSD.

What's in this repo

file what
mimo_v2.5_block_fp8.safetensors 290 GB โ€” repacked weights (bit-identical to upstream)
config.json upstream config + model_type set to mimo_v2_block_fp8
tokenizer.json, tokenizer_config.json upstream, verbatim
generation_config.json upstream, verbatim
mimo_v2_block_fp8.py MLX model class (drop into mlx_lm/models/)
convert_mimo.py converter script โ€” reproduce this file from XiaomiMiMo's release
LICENSE MIT (matches upstream)
NOTICE derivative-work statement
Downloads last month
152
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for bearzi/MiMo-V2.5-MLX

Quantized
(21)
this model