MiMo-V2.5 Coder Q2 v2 GGUF

This is a text-only GGUF build of XiaomiMiMo/MiMo-V2.5, tuned for coding and OpenAI-compatible tool calling on high-memory local machines.

The target system for this build is a 128 GB Apple Silicon machine. The default serving profile uses a 100,000-token context and asks llama.cpp to fit as much of the model as possible onto Metal while leaving enough headroom for KV cache and runtime buffers. Smaller-memory machines will likely need a smaller context, more CPU offload, or a smaller quant.

This is not a multimodal build. MiMo-V2.5 is an omnimodal checkpoint, but this GGUF contains the text model only. The vision and audio encoders are not included. MiMo's multi-token prediction blocks were also omitted because the current llama.cpp MiMo2 generation path does not use those blocks for normal inference.

Why this build exists

Public low-bit quants of very large MoE models can be surprisingly fragile: tool calls may become malformed, code may fail on small API details, and long answers can drift into repeated reasoning loops. This build was made to spend the limited Q2-class quality budget on the workloads where MiMo-V2.5 is most useful locally:

  • coding in common systems and scripting languages
  • web UI/component generation
  • OpenAI-compatible tool calling
  • agent loops over real files and commands
  • long English technical prompts

Chinese-language quality and multimodal behavior were not optimization targets.

How it was built

The source was the original XiaomiMiMo/MiMo-V2.5 checkpoint, converted to GGUF with llama.cpp's native MiMo2 support. Conversion was text-only and omitted runtime-inactive MTP/NextN blocks, so memory is not spent on tensors that current llama.cpp MiMo2 inference does not execute.

The final artifact is a split Q2_K_S GGUF with an importance matrix built from English coding, debugging, tool-calling, shell, and agent prompts. The calibration mix was designed to make the quantizer preserve behavior that matters for developer workflows rather than generic chat breadth.

The build was iterative:

  1. Convert the original checkpoint to split BF16 GGUF.
  2. Produce a first low-bit coding/tool-use candidate.
  3. Test that candidate on executable coding tasks and realistic tool-calling agent loops.
  4. Add calibration coverage for the failures that showed up in real tests.
  5. Rebuild the importance matrix from the expanded coding/tool-use prompt mix.
  6. Re-quantize with the final Q2_K_S recipe.

The calibration text is not required to use the model. It was a build-time tool for telling the quantizer which activations mattered most: code generation, code repair, shell-style work, JSON/tool-call formatting, and agent workflows over real files.

Quantization details:

  • Quant type: Q2_K_S
  • Importance matrix: coding and tool-calling focused
  • Embeddings and output tensors kept at higher precision
  • Attention and dense first-FFN tensors protected at higher precision
  • MoE down-expert tensors kept at Q3_K
  • Reported size: about 108,496.76 MiB, 2.95 BPW
  • Split files: 16 GGUF shards

One tokenizer metadata fix is included: the base-vocabulary </s> token is marked as a control-looking token so llama.cpp does not warn at load time. MiMo's real EOS token remains <|im_end|>.

Why this recipe was chosen

The recipe is a compromise between quality and a hard practical limit: this model has to run locally on a 128 GB unified-memory machine. Higher-bit GGUFs of a model this large can exceed the useful memory envelope once KV cache, batching, Metal buffers, and the operating system are included.

The first plain Q2_K family candidate was small enough, but it was not reliable enough for tool calling. It malformed some tool-call arguments and missed several conditional tools. The v2 recipe is larger, but it spends the extra space where it helped most:

  • embeddings and output tensors stay higher precision because they are important for token identity and exact syntax
  • attention tensors are protected because tool-call and code prompts are structure-heavy
  • the dense first FFN is protected because early-layer representation quality matters disproportionately after heavy quantization
  • MoE down-expert tensors use Q3_K, which was a better quality/memory tradeoff than pushing all expert down-projections lower

That is why this is still a Q2-class build, but not the smallest possible Q2 build.

Why it is good at coding

This quant was not chosen just because it fits in memory. It was iterated against executable tasks and then rebuilt with a stronger coding/tool-use importance matrix after early failures were identified.

The first low-bit pass exposed the kinds of issues that matter in practice: malformed tool-call arguments, brittle JavaScript Markdown parsing, incorrect Zig checked-addition APIs, and small C/C++/Go harness problems. Those failures were used to improve the calibration distribution and to validate that the final model can solve the tasks when the problem statement contains the same constraints a developer would normally give.

The final v2 artifact passed the local coding and web-design harness across:

  • Swift
  • JavaScript
  • TypeScript through Deno
  • Rust
  • C
  • C++
  • Zig
  • Python
  • Perl
  • Go
  • static HTML/CSS

That harness writes complete model-generated files into isolated directories and validates them with local compilers, runtimes, or test runners. The current v2 run passed 11/11. The checks are intentionally practical rather than benchmark-like: they catch whether the generated code compiles, runs, and handles edge cases from the prompt.

It was also tested on framework-style frontend tasks. React, Vue, and Solid components were rendered server-side with Deno/npm tooling, including props, filtering behavior, accessible form markup, and summary text checks. The current v2 run passed 3/3.

The important point is not that these small harnesses prove universal coding ability. They prove that the quantization process did not destroy the details that low-bit models often lose first: exact exported names, balanced parsing logic, checked arithmetic APIs, command/tool argument shapes, and framework-specific rendering conventions.

Tool-calling validation

Tool calling was exercised in realistic agent loops rather than only checking toy single-call examples. The harness used for this validation was Swival. Nothing in the build is tied to it, and any OpenAI-compatible agent harness is likely to work in much the same way, but Swival is the only one that has actually been put through its paces here.

Validation included:

  • a broad synthetic selector suite covering a wide tool surface
  • real one-shot agent tasks over files, grep, command execution, fetches, image input, skills, snapshots, todos, and subagents
  • a real goal-mode run that required the model to complete work and call a final completion tool

The current v2 results were:

  • all-tools selector: 22/22
  • real one-shot agent suite: 10/10 with zero failed tool calls
  • real goal-mode completion call: passed with exactly one successful final call

A separate repetition-loop guard was also run on long coding and web prompts. The current v2 artifact passed 4/4, with no repeated-tail failures.

These are local validation results, not public benchmark scores. They are included so users know what this quant was optimized for and what kinds of regressions were actively checked.

Compared with the earlier local candidate, the v2 build fixed the key practical failures: the selector suite went from 18/22 to 22/22, the coding/web suite reached 11/11 after task prompts were aligned with the validators, and the real agent task suite completed with zero failed tool calls. This is why the package is labeled v2.

Serving with llama.cpp

Recent llama.cpp builds should be able to load the repo directly:

llama-server \
  -hf jedisct1/MiMo-V2.5-coder-Q2-v2 \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 100000 \
  --parallel 1 \
  --batch-size 512 \
  --ubatch-size 128 \
  --threads 12 \
  --threads-batch 18 \
  --prio 0 \
  --poll 80 \
  --flash-attn on \
  --jinja \
  --fit on \
  --fit-target 4096 \
  --fit-ctx 100000 \
  --gpu-layers auto \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --reasoning off

If your llama.cpp build does not auto-select the split GGUF set, pass the first shard explicitly:

llama-server \
  -hf jedisct1/MiMo-V2.5-coder-Q2-v2 \
  --hf-file MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
  --ctx-size 100000 \
  --flash-attn on \
  --jinja \
  --reasoning off

If you cloned or downloaded the repository locally, you can use the helper script:

./run-server.sh

The helper script loads the first GGUF shard next to it and uses the same default serving profile.

Default settings:

MIMO_CTX=100000
MIMO_FIT_CTX=100000
MIMO_FIT_TARGET=4096
MIMO_BATCH=512
MIMO_UBATCH=128
MIMO_REASONING=off
MIMO_CPU_MOE=0

For more memory headroom, use CPU-MoE mode:

MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh

That mode is slower, especially during long prompt prefill, but it leaves more Metal memory available.

You can point the script at a specific server binary:

LLAMA_SERVER=/path/to/llama-server ./run-server.sh

Tool-calling tips

  • Disable reasoning output with --reasoning off or MIMO_REASONING=off.
  • Send tool schemas from the client rather than enabling llama.cpp built-in tools.
  • Set parallel_tool_calls to false if your client supports it.
  • Avoid forcing tool_choice: required; in testing, that made malformed calls more likely.
  • Use a client that supports OpenAI-compatible tool calls cleanly.

License

The upstream XiaomiMiMo/MiMo-V2.5 model card declares the MIT license. This derived GGUF is provided with the same license metadata.

Downloads last month
4
GGUF
Model size
309B params
Architecture
mimo2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jedisct1/MiMo-V2.5-coder-Q2

Quantized
(20)
this model