Instructions to use jedisct1/MiMo-V2.5-coder-Q2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use jedisct1/MiMo-V2.5-coder-Q2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jedisct1/MiMo-V2.5-coder-Q2", filename="MiMo-V2.5-coder-Q2-00001-of-00016.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use jedisct1/MiMo-V2.5-coder-Q2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: ./llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: ./build/bin/llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Use Docker
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- LM Studio
- Jan
- vLLM
How to use jedisct1/MiMo-V2.5-coder-Q2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jedisct1/MiMo-V2.5-coder-Q2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jedisct1/MiMo-V2.5-coder-Q2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Ollama
How to use jedisct1/MiMo-V2.5-coder-Q2 with Ollama:
ollama run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Unsloth Studio new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
- Pi new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jedisct1/MiMo-V2.5-coder-Q2" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jedisct1/MiMo-V2.5-coder-Q2
Run Hermes
hermes
- Docker Model Runner
How to use jedisct1/MiMo-V2.5-coder-Q2 with Docker Model Runner:
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Lemonade
How to use jedisct1/MiMo-V2.5-coder-Q2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jedisct1/MiMo-V2.5-coder-Q2
Run and chat with the model
lemonade run user.MiMo-V2.5-coder-Q2-{{QUANT_TAG}}List all available models
lemonade list
MiMo-V2.5 Coder Q2 v2 GGUF
This is a text-only GGUF build of XiaomiMiMo/MiMo-V2.5, tuned for coding and OpenAI-compatible tool calling on high-memory local machines.
The target system for this build is a 128 GB Apple Silicon machine. The default serving profile uses a 100,000-token context and asks llama.cpp to fit as much of the model as possible onto Metal while leaving enough headroom for KV cache and runtime buffers. Smaller-memory machines will likely need a smaller context, more CPU offload, or a smaller quant.
This is not a multimodal build. MiMo-V2.5 is an omnimodal checkpoint, but this GGUF contains the text model only. The vision and audio encoders are not included. MiMo's multi-token prediction blocks were also omitted because the current llama.cpp MiMo2 generation path does not use those blocks for normal inference.
Why this build exists
Public low-bit quants of very large MoE models can be surprisingly fragile: tool calls may become malformed, code may fail on small API details, and long answers can drift into repeated reasoning loops. This build was made to spend the limited Q2-class quality budget on the workloads where MiMo-V2.5 is most useful locally:
- coding in common systems and scripting languages
- web UI/component generation
- OpenAI-compatible tool calling
- agent loops over real files and commands
- long English technical prompts
Chinese-language quality and multimodal behavior were not optimization targets.
How it was built
The source was the original XiaomiMiMo/MiMo-V2.5 checkpoint, converted to GGUF with llama.cpp's native MiMo2 support. Conversion was text-only and omitted runtime-inactive MTP/NextN blocks, so memory is not spent on tensors that current llama.cpp MiMo2 inference does not execute.
The final artifact is a split Q2_K_S GGUF with an importance matrix built from English coding, debugging, tool-calling, shell, and agent prompts. The calibration mix was designed to make the quantizer preserve behavior that matters for developer workflows rather than generic chat breadth.
The build was iterative:
- Convert the original checkpoint to split BF16 GGUF.
- Produce a first low-bit coding/tool-use candidate.
- Test that candidate on executable coding tasks and realistic tool-calling agent loops.
- Add calibration coverage for the failures that showed up in real tests.
- Rebuild the importance matrix from the expanded coding/tool-use prompt mix.
- Re-quantize with the final
Q2_K_Srecipe.
The calibration text is not required to use the model. It was a build-time tool for telling the quantizer which activations mattered most: code generation, code repair, shell-style work, JSON/tool-call formatting, and agent workflows over real files.
Quantization details:
- Quant type:
Q2_K_S - Importance matrix: coding and tool-calling focused
- Embeddings and output tensors kept at higher precision
- Attention and dense first-FFN tensors protected at higher precision
- MoE down-expert tensors kept at
Q3_K - Reported size: about 108,496.76 MiB, 2.95 BPW
- Split files: 16 GGUF shards
One tokenizer metadata fix is included: the base-vocabulary </s> token is marked as a control-looking token so llama.cpp does not warn at load time. MiMo's real EOS token remains <|im_end|>.
Why this recipe was chosen
The recipe is a compromise between quality and a hard practical limit: this model has to run locally on a 128 GB unified-memory machine. Higher-bit GGUFs of a model this large can exceed the useful memory envelope once KV cache, batching, Metal buffers, and the operating system are included.
The first plain Q2_K family candidate was small enough, but it was not reliable enough for tool calling. It malformed some tool-call arguments and missed several conditional tools. The v2 recipe is larger, but it spends the extra space where it helped most:
- embeddings and output tensors stay higher precision because they are important for token identity and exact syntax
- attention tensors are protected because tool-call and code prompts are structure-heavy
- the dense first FFN is protected because early-layer representation quality matters disproportionately after heavy quantization
- MoE down-expert tensors use
Q3_K, which was a better quality/memory tradeoff than pushing all expert down-projections lower
That is why this is still a Q2-class build, but not the smallest possible Q2 build.
Why it is good at coding
This quant was not chosen just because it fits in memory. It was iterated against executable tasks and then rebuilt with a stronger coding/tool-use importance matrix after early failures were identified.
The first low-bit pass exposed the kinds of issues that matter in practice: malformed tool-call arguments, brittle JavaScript Markdown parsing, incorrect Zig checked-addition APIs, and small C/C++/Go harness problems. Those failures were used to improve the calibration distribution and to validate that the final model can solve the tasks when the problem statement contains the same constraints a developer would normally give.
The final v2 artifact passed the local coding and web-design harness across:
- Swift
- JavaScript
- TypeScript through Deno
- Rust
- C
- C++
- Zig
- Python
- Perl
- Go
- static HTML/CSS
That harness writes complete model-generated files into isolated directories and validates them with local compilers, runtimes, or test runners. The current v2 run passed 11/11. The checks are intentionally practical rather than benchmark-like: they catch whether the generated code compiles, runs, and handles edge cases from the prompt.
It was also tested on framework-style frontend tasks. React, Vue, and Solid components were rendered server-side with Deno/npm tooling, including props, filtering behavior, accessible form markup, and summary text checks. The current v2 run passed 3/3.
The important point is not that these small harnesses prove universal coding ability. They prove that the quantization process did not destroy the details that low-bit models often lose first: exact exported names, balanced parsing logic, checked arithmetic APIs, command/tool argument shapes, and framework-specific rendering conventions.
Tool-calling validation
Tool calling was exercised in realistic agent loops rather than only checking toy single-call examples. The harness used for this validation was Swival. Nothing in the build is tied to it, and any OpenAI-compatible agent harness is likely to work in much the same way, but Swival is the only one that has actually been put through its paces here.
Validation included:
- a broad synthetic selector suite covering a wide tool surface
- real one-shot agent tasks over files, grep, command execution, fetches, image input, skills, snapshots, todos, and subagents
- a real goal-mode run that required the model to complete work and call a final completion tool
The current v2 results were:
- all-tools selector: 22/22
- real one-shot agent suite: 10/10 with zero failed tool calls
- real goal-mode completion call: passed with exactly one successful final call
A separate repetition-loop guard was also run on long coding and web prompts. The current v2 artifact passed 4/4, with no repeated-tail failures.
These are local validation results, not public benchmark scores. They are included so users know what this quant was optimized for and what kinds of regressions were actively checked.
Compared with the earlier local candidate, the v2 build fixed the key practical failures: the selector suite went from 18/22 to 22/22, the coding/web suite reached 11/11 after task prompts were aligned with the validators, and the real agent task suite completed with zero failed tool calls. This is why the package is labeled v2.
Serving with llama.cpp
Recent llama.cpp builds should be able to load the repo directly:
llama-server \
-hf jedisct1/MiMo-V2.5-coder-Q2-v2 \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 100000 \
--parallel 1 \
--batch-size 512 \
--ubatch-size 128 \
--threads 12 \
--threads-batch 18 \
--prio 0 \
--poll 80 \
--flash-attn on \
--jinja \
--fit on \
--fit-target 4096 \
--fit-ctx 100000 \
--gpu-layers auto \
--cache-type-k f16 \
--cache-type-v f16 \
--reasoning off
If your llama.cpp build does not auto-select the split GGUF set, pass the first shard explicitly:
llama-server \
-hf jedisct1/MiMo-V2.5-coder-Q2-v2 \
--hf-file MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
--ctx-size 100000 \
--flash-attn on \
--jinja \
--reasoning off
If you cloned or downloaded the repository locally, you can use the helper script:
./run-server.sh
The helper script loads the first GGUF shard next to it and uses the same default serving profile.
Default settings:
MIMO_CTX=100000
MIMO_FIT_CTX=100000
MIMO_FIT_TARGET=4096
MIMO_BATCH=512
MIMO_UBATCH=128
MIMO_REASONING=off
MIMO_CPU_MOE=0
For more memory headroom, use CPU-MoE mode:
MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh
That mode is slower, especially during long prompt prefill, but it leaves more Metal memory available.
You can point the script at a specific server binary:
LLAMA_SERVER=/path/to/llama-server ./run-server.sh
Tool-calling tips
- Disable reasoning output with
--reasoning offorMIMO_REASONING=off. - Send tool schemas from the client rather than enabling llama.cpp built-in tools.
- Set
parallel_tool_callstofalseif your client supports it. - Avoid forcing
tool_choice: required; in testing, that made malformed calls more likely. - Use a client that supports OpenAI-compatible tool calls cleanly.
License
The upstream XiaomiMiMo/MiMo-V2.5 model card declares the MIT license. This derived GGUF is provided with the same license metadata.
- Downloads last month
- 4
We're not able to determine the quantization variants.
Model tree for jedisct1/MiMo-V2.5-coder-Q2
Base model
XiaomiMiMo/MiMo-V2.5