Instructions to use moonshotai/Kimi-K2.7-Code with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2.7-Code with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.7-Code", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("moonshotai/Kimi-K2.7-Code", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use moonshotai/Kimi-K2.7-Code with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2.7-Code" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.7-Code", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2.7-Code
- SGLang
How to use moonshotai/Kimi-K2.7-Code with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.7-Code" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.7-Code", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.7-Code" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.7-Code", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2.7-Code with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2.7-Code
MLX g32 bf16 conversion for TP=2 Mac Studios @20.1 tok/s average
For no loss due to dequant/requant, simple repackaging to MLX a quick guide to the extent helpful, now working smoothly with opencode:
Tensor-Parallel Serving Across Two Mac Studios (MLX + JACCL / Thunderbolt 5)
What this is: a battle-tested guide to serving the model across two Mac Studio M3 Ultra boxes as a single OpenAI-compatible endpoint, using MLX tensor parallelism (TP=2) over Apple's JACCL collective library on a Thunderbolt-5 RDMA ring.
It is written from a real bring-up that took many hours and several reboots. The Β§1 gotchas are the whole point β every one cost us time. Read them first.
Placeholders used throughout (substitute your own):
<USER>= your macOS username Β·studioA/<STUDIO_A_IP>= the Mac running rank 0 (HTTP) Β·studioB/<STUDIO_B_IP>= rank 1 (compute) Β·~= home Β·~/venvs/mlx-tp2= the MLX venv Β·~/cluster/= host-side config dir Β·~/models/= model dir.
0. Convert the checkpoint to MLX (no quantization) β run first, on each Mac
Many large MoE checkpoints ship as compressed-tensors (INT4 group-32 routed experts + BF16 everything else). The modern mlx_lm loader reads that format natively, so "conversion" is just adding a quantization block to config.json β no weight bytes change.
# On each Mac. Symlinks the native shards into <dst>, writes a patched config.json.
python3 repack_mlx.py --src ~/models/<MODEL>-native --dst ~/models/<MODEL>-MLX
repack_mlx.py (self-contained):
#!/usr/bin/env python3
import argparse, json, os
from pathlib import Path
QUANT = {"group_size": 32, "bits": 4, "mode": "affine"} # match the native expert grid
ap = argparse.ArgumentParser()
ap.add_argument("--src", required=True, type=Path) # native compressed-tensors checkpoint
ap.add_argument("--dst", required=True, type=Path) # MLX output dir (symlink farm)
a = ap.parse_args()
a.dst.mkdir(parents=True, exist_ok=True)
for f in sorted(a.src.iterdir()):
if f.name == "config.json":
continue
t = a.dst / f.name
if t.is_symlink() or t.exists():
t.unlink()
os.symlink(f.resolve(), t) # symlink every shard/file (no copy)
cfg = json.loads((a.src / "config.json").read_text())
cfg["quantization"] = QUANT # the ONLY new bytes on disk
(a.dst / "config.json").write_text(json.dumps(cfg, indent=2))
print(f"repacked {a.src} -> {a.dst} (experts int4 g32; attn/mlp/embed/lm_head bf16; no requant)")
Equivalent without the script:
SRC=~/models/<MODEL>-native; DST=~/models/<MODEL>-MLX
mkdir -p "$DST"
ln -s "$SRC"/* "$DST"/ && rm "$DST/config.json"
python3 -c 'import json,sys; c=json.load(open(sys.argv[1])); c["quantization"]={"group_size":32,"bits":4,"mode":"affine"}; json.dump(c,open(sys.argv[2],"w"),indent=2)' "$SRC/config.json" "$DST/config.json"
At load, mlx_lm reinterprets the packed INT4 (weight_packed.view(uint32), biases = -8*scales) and wraps only the expert modules as int4 QuantizedSwitchLinear; BF16 tensors stay BF16. (mlx_lm.convert is the other path β it dequantizes then requantizes, which changes the weights; this guide does not use it.)
Do this on both Macs at the same path, then continue.
1. The six hard-won gotchas (read before anything else)
en5MTU must be 9000. The TB5 ring defaults to MTU 1500 after every reboot. At 1500, JACCL's_share_object(rank-0 β rank-1 prompt transfer) corrupts payloads under load β coherent output for a request or two, then gibberish (<br><br>3333β¦), thenRuntimeError: share_object: payload corrupt after retriesand a rank crash. Fix it AND persist it (Β§4.2).Request the served model id, not the checkpoint path.
mlx_lm.serveradvertises the model by its on-disk path. Sending that path as the"model"field triggers a broken distributed code path β corruption. Sending the built-in aliasdefault_modeluses the loaded model directly and is rock-solid. Any other string returns 404 and can desync/crash the distributed server. Clients senddefault_model.--prefill-step-sizeis a memory landmine on bf16-attention models. Bigger = faster prompt processing, but the prefill activation peak scales with it. On a model with bf16 attention,2048OOMs;64works but is painfully slow (minutes for a 16K-token prompt).512is the safe middle. A model with 8-bit attention tolerates2048.An OOM crash leaks wired GPU memory β only a reboot clears it. When a rank hits
[METAL] β¦ Insufficient Memory, the process dies but its wired memory stays pinned (wired β 492 GiBwith no process holding it). Any relaunch then OOMs immediately. No userspace reclaim β reboot both Macs. Tune conservatively; every wrong guess costs a reboot.A reboot also clears a "stuck" JACCL ring / wedged Metal state. If the ring won't form or a prior crash left things wedged, reboot both boxes. After reboot, log into each via Screen Sharing so Metal/GPU is available (a headless boot may not expose the GPU; user LaunchAgents need a GUI session).
Custom-tokenizer checkpoints need
--trust-remote-code.compressed-tensorsmodels carry anauto_map+ custom tokenizer; without the flag the load throwsValueError: β¦ contains custom codeon every rank.
2. Topology
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
client ββββΆβ studioA <STUDIO_A_IP> ββββββββββΆβ studioB <STUDIO_B_IP> β
(OpenAI β rank 0 Β· HTTP :8000 β TB5 β rank 1 Β· compute-only β
/v1) β ~277 GiB weights β en5 β ~277 GiB weights β
ββββββββββββββββββββββββββββ JACCL ββββββββββββββββββββββββββββ
M3 Ultra Β· 512 GB RAM RDMA M3 Ultra Β· 512 GB RAM
wired ceiling 493 GiB wired ceiling 493 GiB
- Rank 0 = studioA runs the HTTP server and computes. Rank 1 = studioB is compute-only (no HTTP listener of its own).
- The model is tensor-sharded: each box holds ~half the weights and they run each forward pass in lockstep, exchanging activations over the ring.
- Transport: JACCL over
en5(the direct Thunderbolt-5 cable, link-local169.254.x) β not an Ethernet fabric.
3. Prerequisites (per Mac)
| Component | Value used here |
|---|---|
| OS | macOS 26.x, GUI session logged in (Screen Sharing OK) |
| Python venv | ~/venvs/mlx-tp2 (must contain mlx.launch) |
| MLX | mlx==0.31.2, mlx-lm==0.31.3 |
| Model class | loader must include your model class and mx.distributed.is_available() == True |
| Model on disk | identical path on both boxes, e.g. ~/models/<MODEL>-MLX (APFS is case-insensitive) |
| SSH | passwordless studioA β studioB (mlx.launch SSHes rank 1) |
~/venvs/mlx-tp2/bin/python -c \
'import mlx.core as mx, mlx_lm; print("mlx",mx.__version__,"mlx_lm",mlx_lm.__version__,"dist",mx.distributed.is_available())'
ls ~/venvs/mlx-tp2/bin/mlx.launch
One-time host setup
4.1 Appliance settings (max memory + no Spotlight stalls)
Raise the IOGPU wired limit and quiet the OS so the big job gets the whole machine:
sudo sysctl -w iogpu.wired_limit_mb=505000 # ~493 GiB wired ceiling
sudo mdutil -i off / # Spotlight off (write bursts during load starve Metal)
sudo mdutil -i off /Volumes/* 2>/dev/null || true
sudo touch /.metadata_never_index
(iogpu.wired_limit_mb is not persistent by default β persist it with a boot LaunchDaemon like Β§4.2, or your provisioning.)
4.2 Persist en5 MTU 9000 (THE critical fix)
The MTU resets to 1500 on every boot, so re-apply automatically. A root LaunchDaemon sets it at boot and re-asserts every 60s.
Immediate (both Macs): sudo ifconfig en5 mtu 9000
Persistent (both Macs, one-time):
sudo mkdir -p /usr/local/bin
printf '#!/bin/bash\nfor i in $(seq 1 15); do /sbin/ifconfig en5 >/dev/null 2>&1 && { /sbin/ifconfig en5 mtu 9000; break; }; sleep 2; done\n' | sudo tee /usr/local/bin/en5-mtu9000.sh
sudo chmod 755 /usr/local/bin/en5-mtu9000.sh
printf '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n<plist version="1.0"><dict><key>Label</key><string>com.local.en5-mtu9000</string><key>ProgramArguments</key><array><string>/bin/bash</string><string>/usr/local/bin/en5-mtu9000.sh</string></array><key>RunAtLoad</key><true/><key>StartInterval</key><integer>60</integer></dict></plist>\n' | sudo tee /Library/LaunchDaemons/com.local.en5-mtu9000.plist
sudo chown root:wheel /Library/LaunchDaemons/com.local.en5-mtu9000.plist && sudo chmod 644 /Library/LaunchDaemons/com.local.en5-mtu9000.plist
sudo launchctl bootstrap system /Library/LaunchDaemons/com.local.en5-mtu9000.plist || sudo launchctl load -w /Library/LaunchDaemons/com.local.en5-mtu9000.plist
Verify after a reboot: ifconfig en5 | grep mtu β mtu 9000, no manual step.
4.3 JACCL hostfile (~/cluster/jaccl_hostfile.json on studioA)
{
"backend": "jaccl",
"envs": ["MLX_METAL_FAST_SYNCH=1"],
"hosts": [
{"ssh": "<USER>@<STUDIO_A_IP>", "ips": ["<STUDIO_A_IP>"], "rdma": [null, "rdma_en5"]},
{"ssh": "<USER>@<STUDIO_B_IP>", "ips": [], "rdma": ["rdma_en5", null]}
]
}
The rdma_en5 entries pin the collective to the Thunderbolt link. Rank order here defines rank 0 (studioA) and rank 1 (studioB).
5. The launcher (~/cluster/tp2-launch.sh on studioA)
mlx.launch reads the hostfile, starts rank 0 locally, SSHes rank 1, and runs the same mlx_lm.server on both. Per-model defaults keep each model in its safe envelope:
#!/bin/bash
set -euo pipefail
MODEL_NAME=${1:?Usage: tp2-launch.sh MODEL_NAME [PORT]}
PORT=${2:-8000}
case "$MODEL_NAME" in
# bf16-attention model -> 512 (2048 OOMs)
<MODEL>-MLX) VENV_NAME="mlx-tp2"; PREFILL_DEFAULT=512; MAXTOK_DEFAULT=8192 ;;
*) echo "Unsupported TP2 model: $MODEL_NAME" >&2; exit 64 ;;
esac
PREFILL=${TP2_PREFILL:-$PREFILL_DEFAULT}
MAX_TOKENS=${TP2_MAX_TOKENS:-$MAXTOK_DEFAULT}
LAUNCHER="$HOME/venvs/$VENV_NAME/bin/mlx.launch"
REMOTE_CMD=$(cat <<REMOTE
PY="\$HOME/venvs/$VENV_NAME/bin/python"
MODEL_PATH="\$HOME/models/$MODEL_NAME"
exec "\$PY" -m mlx_lm.server --model "\$MODEL_PATH" --host 0.0.0.0 --port $PORT \
--max-tokens $MAX_TOKENS --prefill-step-size $PREFILL \
--decode-concurrency 1 --prompt-concurrency 1 \
--prompt-cache-size 0 --prompt-cache-bytes 0 --trust-remote-code
REMOTE
)
exec "$LAUNCHER" --verbose --cwd /tmp --hostfile "$HOME/cluster/jaccl_hostfile.json" \
--no-verify-script -- env bash -c "$REMOTE_CMD"
Notes: --trust-remote-code (gotcha 6); --decode-concurrency 1 --prompt-concurrency 1 = single-stream (one request at a time); --prompt-cache-size 0 disables prefix caching (Β§9).
6. Bring-up
Pre-flight: both Macs powered, GUI logged in,
wired β 5β6 GiB(no leak),ifconfig en5showsmtu 9000,iogpu.wired_limit_mb = 505000. Ifwiredis high with no process β leak β reboot (gotcha 4).
# 0) clean stale ranks on both boxes
ssh studioA 'pkill -9 -f "mlx.launch|mlx_lm.server"; rm -f ~/cluster/tp2.log'
ssh studioB 'pkill -9 -f "mlx.launch|mlx_lm.server"'
# 1) launch (rank-0 box drives both ranks via the hostfile)
ssh studioA 'nohup ~/cluster/tp2-launch.sh <MODEL>-MLX 8000 > ~/cluster/tp2.log 2>&1 < /dev/null &'
# 2) watch load: tokenizer loads on BOTH ranks, then weights materialize (~2β4 min)
ssh studioA 'tail -f ~/cluster/tp2.log' # look for "Reloaded tiktoken model from <studioB path>"
Healthy load reaches ~277 GiB resident per box under request load. A momentary idle imbalance (e.g. studioA=337 / studioB=6) is just MLX lazy materialization β both ranks engage when a request arrives.
7. Validation
# A) generation (ALWAYS model=default_model)
ssh studioA 'curl -s http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" \
-d "{\"model\":\"default_model\",\"messages\":[{\"role\":\"user\",\"content\":\"Capital of France? One word.\"}],\"max_tokens\":40,\"temperature\":0}"'
# B) streaming (SSE) β one "data:" line per token
ssh studioA 'curl -sN http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" \
-d "{\"model\":\"default_model\",\"stream\":true,\"messages\":[{\"role\":\"user\",\"content\":\"Count 1 to 6.\"}],\"max_tokens\":50}" | grep "^data:"'
# C) corruption/OOM scan (expect empty)
ssh studioA 'grep -E "payload corrupt|OutOfMemory|exited with code" ~/cluster/tp2.log'
8. Client wiring (OpenAI-compatible, e.g. OpenCode)
"studio_tp2": {
"npm": "@ai-sdk/openai-compatible",
"name": "Studio TP=2 <MODEL>",
"options": { "baseURL": "http://<STUDIO_A_IP>:8000/v1", "apiKey": "EMPTY", "timeout": 900000, "chunkTimeout": 120000 },
"models": {
"default_model": { // MUST be default_model (gotcha 2)
"name": "<MODEL> (TP=2)",
"reasoning": true, "tool_call": false, "temperature": true,
"limit": { "context": 131072, "output": 8192 }
}
}
}
mlx_lm.server streams by default; tool_call:false (raw server has no OpenAI function-calling β route through a tool adapter if you need agentic tools).
9. Sizing & known limitations
Per-box budget: 493 GiB wired ceiling β ~277 GiB weights β 216 GiB for KV cache + activations + prefill scratch. MLA (low-rank compressed KV) makes KV cheap per token; the constraint is the prefill activation peak (set by --prefill-step-size), not KV. A smaller-on-disk quant can peak higher at runtime β budget the runtime peak, not the disk footprint.
Limitations: single-stream (one request at a time); no prefix cache (a growing conversation re-prefills each turn); not boot-persistent (the serve is nohup β relaunch after reboot); occupies both boxes (can't coexist with another TP=2 model on the pair).
10. Troubleshooting matrix
| Symptom | Cause | Fix |
|---|---|---|
Output coherent β gibberish β payload corrupt after retries, rank crash |
en5 MTU = 1500 |
sudo ifconfig en5 mtu 9000 + LaunchDaemon (Β§4.2) |
| 404 then server unstable/crashes | client sent a non-default_model id |
clients send default_model |
[METAL] β¦ Insufficient Memory, rank exits 255 |
prefill batch too large for bf16-attn model | lower --prefill-step-size (512) |
Relaunch instantly OOMs; wired β 492 GiB, no process |
leaked wired memory from a prior OOM | reboot both Macs |
ValueError: β¦ contains custom code on load |
custom-tokenizer checkpoint | add --trust-remote-code |
Ring won't form / [jaccl] Recv failed / both ranks idle |
wedged JACCL/Metal (often post-crash) | reboot both, relaunch |
| First message "hangs" minutes | big prompt Γ tiny --prefill-step-size |
raise prefill-step-size (within the OOM-safe envelope) |
/v1/models 200 but generation hangs/empty |
a rank died at decode | check log for crash class; relaunch (reboot if leaked) |
11. Measured performance (Kimi-K2.7-Code, this setup)
| Metric | Result |
|---|---|
| Decode (generation) | ~20 tok/s sustained |
| Prefill (prompt processing) | ~286 tok/s (7,016-token prompt in ~24.5s) |
| TTFT, warm short prompt | ~0.3s |
| TTFT, ~2K-token prompt | ~3s |
| First request (cold) | +~5s one-time (graph compile) |
12. Crash-recovery discipline
This stack is fragile to crashes: an OOM or corruption crash can pin wired memory, and the clean recovery is a reboot of both Macs. Operating rules: (1) tune --prefill-step-size conservatively, raise in small steps; (2) clients send only default_model; (3) keep MTU 9000 persisted so it's never the variable; (4) if wired is stuck high with no process, reboot, log into the GUI, confirm en5 mtu 9000 auto-applied, relaunch.
Generation speed its ok but prompt processing is so slow (~286 tok/s) that it is completely unusable. :(
Generation speed its ok but prompt processing is so slow (~286 tok/s) that it is completely unusable. :(
If nearly 300 tokens/sec prompt processing is unusable to you on a 1 trillion parameter SOTA model, then I have no idea what to tell you.
tried a handful of different ways to accelerate the prompt processing but until an upgrade to M5Ultra, its a metal/silicon limit particular to apple. The other avenue is to train an eagle3 model as drafter and convert to MLX. I was planning on using a 2.6 version of the drafter to try out MTP to see what the matching rate is, will have hands on a DGX Station soon and could likely run the training to get to an optimized drafter.
I don't understand you. In a coding or agent task, the input jumps to 60-80k tokens in a matter of seconds. So I have to wait 4 minutes each time for it to even process the input!
My own machine has an nvidia 3090 and even 2-3000 pp (llama.cpp/qwen3.6) slows down the task terribly in the case of a hermes agent.
I don't understand you. In a coding or agent task, the input jumps to 60-80k tokens in a matter of seconds. So I have to wait 4 minutes each time for it to even process the input!
My own machine has an nvidia 3090 and even 2-3000 pp (llama.cpp/qwen3.6) slows down the task terribly in the case of a hermes agent.
The input jumps to 60k tokens because you have no idea how to configure an intelligent prompt would be my guess. I have a 5090 that I use regularly for quick jobs, but when I need careful and thorough analysis of a code, I do not dump the entire codebase into the AI and say, "Have fun and fix it, thanks!"
I also have no idea why you'd think that 4 minutes is some eternity to process 80k tokens! That's crazy talk. This is a SOTA model, routinely ranked in the top 10 models in the entire world. And you cannot wait 4 minutes for it solve something for you?
In looking at your own profile, I see that you have a list of models sized 0.8B - 2B. Models that size give you a fast response that's nonsense. Why even bother with them? You understand how much better this model is, right?