MiMo-V2.5 Coder Q2 v2 MTP GGUF
NOTE: on a 128 Gb system, the non-MTP build is recommended, and is likely to be faster
This is the MTP-included sibling of MiMo-V2.5-coder-Q2-v2: a text-only GGUF build of XiaomiMiMo/MiMo-V2.5, tuned for coding and OpenAI-compatible tool calling on high-memory local machines.
Most users should start with the non-MTP MiMo-V2.5-coder-Q2-v2 package. This MTP variant is larger and is mainly for users who specifically want MiMo's preserved multi-token prediction tensors in GGUF form.
The target system for this build is a 128 GB Apple Silicon machine, but this MTP-included artifact is tighter than the non-MTP package. The default serving profile uses a 100,000-token context and asks llama.cpp to fit as much of the model as possible onto Metal while leaving enough headroom for KV cache and runtime buffers. Smaller-memory machines will likely need a smaller context, more CPU offload, or the non-MTP build.
This is not a multimodal build. MiMo-V2.5 is an omnimodal checkpoint, but this GGUF contains the text model only. The vision and audio encoders are not included. Unlike the non-MTP package, MiMo's multi-token prediction blocks are preserved here.
Why this build exists
Public low-bit quants of very large MoE models can be surprisingly fragile: tool calls may become malformed, code may fail on small API details, and long answers can drift into repeated reasoning loops. The v2 quantization recipe was made to spend the limited Q2-class quality budget on the workloads where MiMo-V2.5 is most useful locally:
- coding in common systems and scripting languages
- web UI/component generation
- OpenAI-compatible tool calling
- agent loops over real files and commands
- long English technical prompts
This MTP variant exists for completeness and future runtime work. The non-MTP v2 package is smaller, fits a 128 GB machine better, and was the artifact used for the final coding and Swival validation. This package keeps the same v2 coding/tool-use bias while preserving MiMo's three MTP/NextN blocks.
Chinese-language quality and multimodal behavior were not optimization targets.
MTP runtime note
This GGUF keeps MiMo's three multi-token prediction blocks. Current llama.cpp builds have generic draft-mtp support for some model families, but the MiMo2 backend does not currently execute MiMo's embedded MTP blocks for normal generation. llama.cpp will report the trailing MTP tensors as unused because the MiMo2 loader marks them as skipped.
That warning means the preserved MTP tensors are present but ignored by the current runtime. It is not a corrupted-file warning. Until llama.cpp grows a MiMo2 MTP graph, this package is a larger reference build rather than the practical default for day-to-day serving.
How it was built
The source was the original XiaomiMiMo/MiMo-V2.5 checkpoint, converted to split BF16 GGUF with llama.cpp's native MiMo2 support. Conversion was text-only and preserved MiMo's MTP/NextN blocks.
The final artifact is a split Q2_K_S GGUF using the v2 importance matrix from the non-MTP coding/tool-use build. That matrix was built from English coding, debugging, tool-calling, shell, and agent prompts. It was used to make the quantizer preserve behavior that matters for developer workflows rather than generic chat breadth.
The build followed the same iterative path as the non-MTP v2 package:
- Convert the original checkpoint to split BF16 GGUF.
- Produce a first low-bit coding/tool-use candidate.
- Test that candidate on executable coding tasks and realistic tool-calling agent loops.
- Add calibration coverage for the failures that showed up in real tests.
- Rebuild the importance matrix from the expanded coding/tool-use prompt mix.
- Re-quantize with the final
Q2_K_Srecipe. - Apply the v2 matrix to the MTP-preserving BF16 split and manually protect the MTP tensors that were not present in the non-MTP calibration model.
The calibration text is not required to use the model. It was a build-time tool for telling the quantizer which activations mattered most: code generation, code repair, shell-style work, JSON/tool-call formatting, and agent workflows over real files.
Quantization details:
- Quant type:
Q2_K_S - Importance matrix: v2 coding and tool-calling focused matrix
- Embeddings and output tensors kept at higher precision
- Attention and dense first-FFN tensors protected at higher precision
- MTP dense FFN and
nextn.eh_projtensors protected atQ4_K - MoE down-expert tensors kept at
Q3_K - Quantizer-reported model size: about 109,026.87 MiB, 2.95 BPW
- Split shard total on disk: about 109,032.58 MiB, 106.48 GiB
- MTP metadata:
mimo2.block_count = 51,mimo2.nextn_predict_layers = 3 - Split files: 16 GGUF shards named
MiMo-V2.5-coder-Q2-v2-MTP-00001-of-00016.ggufthroughMiMo-V2.5-coder-Q2-v2-MTP-00016-of-00016.gguf
One tokenizer metadata fix is included: the base-vocabulary </s> token is marked as a control-looking token so llama.cpp does not warn at load time. MiMo's real EOS token remains <|im_end|>.
Why this recipe was chosen
The recipe is a compromise between quality and a hard practical limit: this model has to run locally on a 128 GB unified-memory machine. Higher-bit GGUFs of a model this large can exceed the useful memory envelope once KV cache, batching, Metal buffers, and the operating system are included.
The first plain Q2_K family candidate was small enough, but it was not reliable enough for tool calling. It malformed some tool-call arguments and missed several conditional tools. The v2 recipe is larger, but it spends the extra space where it helped most:
- embeddings and output tensors stay higher precision because they are important for token identity and exact syntax
- attention tensors are protected because tool-call and code prompts are structure-heavy
- the dense first FFN is protected because early-layer representation quality matters disproportionately after heavy quantization
- MoE down-expert tensors use
Q3_K; this was kept from the known-good imatrix recipe rather than isolated as the only required choice
This MTP artifact follows that same recipe class, then adds Q4_K protection for the preserved MTP FFN, attention, and nextn.eh_proj tensors. Those tensors are not used by current llama.cpp MiMo2 generation, but preserving them at higher precision makes this package a better reference artifact if MiMo2 MTP execution is added later.
That is why this is still a Q2-class build, but not the smallest possible Q2 build.
Why it is good at coding
The v2 quant was not chosen just because it fits in memory. It was iterated against executable tasks and then rebuilt with a stronger coding/tool-use importance matrix after early failures were identified.
The first low-bit pass exposed the kinds of issues that matter in practice: malformed tool-call arguments, brittle JavaScript Markdown parsing, incorrect Zig checked-addition APIs, and small C/C++/Go harness problems. Those failures were used to improve the calibration distribution and to validate that the final model can solve the tasks when the problem statement contains the same constraints a developer would normally give.
The non-MTP v2 artifact passed the local coding and web-design harness across:
- Swift
- JavaScript
- TypeScript through Deno
- Rust
- C
- C++
- Zig
- Python
- Perl
- Go
- static HTML/CSS
That harness writes complete model-generated files into isolated directories and validates them with local compilers, runtimes, or test runners. The current non-MTP v2 run passed 11/11. The checks are intentionally practical rather than benchmark-like: they catch whether the generated code compiles, runs, and handles edge cases from the prompt.
It was also tested on framework-style frontend tasks. React, Vue, and Solid components were rendered server-side with Deno/npm tooling, including props, filtering behavior, accessible form markup, and summary text checks. The current non-MTP v2 run passed 3/3.
This MTP package uses the same v2 matrix and recipe class, with additional manual protection for the preserved MTP tensors. Unless this package is tested separately, do not read the non-MTP pass counts as direct benchmark scores for the MTP artifact itself. They are included because they explain what the v2 calibration target optimized for and what kinds of regressions were actively checked.
The important point is not that these small harnesses prove universal coding ability. They prove that the v2 quantization process did not destroy the details that low-bit models often lose first: exact exported names, balanced parsing logic, checked arithmetic APIs, command/tool argument shapes, and framework-specific rendering conventions.
Tool-calling validation
Tool calling was exercised in realistic agent loops rather than only checking toy single-call examples. The harness used for this validation was Swival. Nothing in the build is tied to it, and any OpenAI-compatible agent harness is likely to work in much the same way, but Swival is the only one that has actually been put through its paces here.
Validation for the non-MTP v2 artifact included:
- a broad synthetic selector suite covering a wide tool surface
- real one-shot agent tasks over files, grep, command execution, fetches, image input, skills, snapshots, todos, and subagents
- a real goal-mode run that required the model to complete work and call a final completion tool
The current non-MTP v2 results were:
- all-tools selector: 22/22
- real one-shot agent suite: 10/10 with zero failed tool calls
- real goal-mode completion call: passed with exactly one successful final call
A separate repetition-loop guard was also run on long coding and web prompts. The current non-MTP v2 artifact passed 4/4, with no repeated-tail failures.
These are local validation results, not public benchmark scores. They are included so users know what this quant was optimized for and what kinds of regressions were actively checked. This MTP package has been load-smoked separately, but the full coding and Swival suites above should be treated as validation of the shared v2 calibration target rather than direct MTP benchmark scores.
Compared with the earlier local candidate, the v2 build fixed the key practical failures: the selector suite went from 18/22 to 22/22, the coding/web suite reached 11/11 after task prompts were aligned with the validators, and the real agent task suite completed with zero failed tool calls. This is why the package is labeled v2.
Serving with llama.cpp
Recent llama.cpp builds should be able to load the repo directly:
llama-server \
-hf jedisct1/MiMo-V2.5-coder-Q2-v2-MTP \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 100000 \
--parallel 1 \
--batch-size 512 \
--ubatch-size 128 \
--threads 12 \
--threads-batch 18 \
--prio 0 \
--poll 80 \
--flash-attn on \
--jinja \
--fit on \
--fit-target 4096 \
--fit-ctx 100000 \
--gpu-layers auto \
--cache-type-k f16 \
--cache-type-v f16 \
--reasoning off
If your llama.cpp build does not auto-select the split GGUF set, pass the first shard explicitly:
llama-server \
-hf jedisct1/MiMo-V2.5-coder-Q2-v2-MTP \
--hf-file MiMo-V2.5-coder-Q2-v2-MTP-00001-of-00016.gguf \
--ctx-size 100000 \
--flash-attn on \
--jinja \
--reasoning off
If you cloned or downloaded the repository locally, you can use the helper script:
./run-server.sh
The helper script loads the first GGUF shard next to it and uses the same default serving profile.
Default settings:
MIMO_CTX=100000
MIMO_FIT_CTX=100000
MIMO_FIT_TARGET=4096
MIMO_BATCH=512
MIMO_UBATCH=128
MIMO_REASONING=off
MIMO_CPU_MOE=0
For more memory headroom, use CPU-MoE mode:
MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh
That mode is slower, especially during long prompt prefill, but it leaves more Metal memory available.
You can point the script at a specific server binary:
LLAMA_SERVER=/path/to/llama-server ./run-server.sh
Tool-calling tips
- Disable reasoning output with
--reasoning offorMIMO_REASONING=off. - Send tool schemas from the client rather than enabling llama.cpp built-in tools.
- Set
parallel_tool_callstofalseif your client supports it. - Avoid forcing
tool_choice: required; in testing, that made malformed calls more likely. - Use a client that supports OpenAI-compatible tool calls cleanly.
License
The upstream XiaomiMiMo/MiMo-V2.5 model card declares the MIT license. This derived GGUF is provided with the same license metadata.
- Downloads last month
- 228
We're not able to determine the quantization variants.
Model tree for jedisct1/MiMo-V2.5-coder-Q2-v2-MTP
Base model
XiaomiMiMo/MiMo-V2.5