Instructions to use moonshotai/Kimi-K2.7-Code with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2.7-Code with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.7-Code", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-K2.7-Code", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Kimi-K2.7-Code with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2.7-Code"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.7-Code",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2.7-Code

SGLang

How to use moonshotai/Kimi-K2.7-Code with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2.7-Code" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.7-Code",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2.7-Code" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.7-Code",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2.7-Code with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2.7-Code
```

MLX g32 bf16 conversion for TP=2 Mac Studios @20.1 tok/s average

by whistlercapital - opened 4 days ago

Discussion

whistlercapital

4 days ago

•

edited 4 days ago

For no loss due to dequant/requant, simple repackaging to MLX a quick guide to the extent helpful, now working smoothly with opencode:

Tensor-Parallel Serving Across Two Mac Studios (MLX + JACCL / Thunderbolt 5)

What this is: a battle-tested guide to serving the model across two Mac Studio M3 Ultra boxes as a single OpenAI-compatible endpoint, using MLX tensor parallelism (TP=2) over Apple's JACCL collective library on a Thunderbolt-5 RDMA ring.

It is written from a real bring-up that took many hours and several reboots. The §1 gotchas are the whole point — every one cost us time. Read them first.

Placeholders used throughout (substitute your own): <USER> = your macOS username · studioA / <STUDIO_A_IP> = the Mac running rank 0 (HTTP) · studioB / <STUDIO_B_IP> = rank 1 (compute) · ~ = home · ~/venvs/mlx-tp2 = the MLX venv · ~/cluster/ = host-side config dir · ~/models/ = model dir.

0. Convert the checkpoint to MLX (no quantization) — run first, on each Mac

Many large MoE checkpoints ship as compressed-tensors (INT4 group-32 routed experts + BF16 everything else). The modern mlx_lm loader reads that format natively, so "conversion" is just adding a quantization block to config.json — no weight bytes change.

# On each Mac. Symlinks the native shards into <dst>, writes a patched config.json.
python3 repack_mlx.py --src ~/models/<MODEL>-native --dst ~/models/<MODEL>-MLX

repack_mlx.py (self-contained):

#!/usr/bin/env python3
import argparse, json, os
from pathlib import Path
QUANT = {"group_size": 32, "bits": 4, "mode": "affine"}  # match the native expert grid
ap = argparse.ArgumentParser()
ap.add_argument("--src", required=True, type=Path)   # native compressed-tensors checkpoint
ap.add_argument("--dst", required=True, type=Path)   # MLX output dir (symlink farm)
a = ap.parse_args()
a.dst.mkdir(parents=True, exist_ok=True)
for f in sorted(a.src.iterdir()):
    if f.name == "config.json":
        continue
    t = a.dst / f.name
    if t.is_symlink() or t.exists():
        t.unlink()
    os.symlink(f.resolve(), t)                       # symlink every shard/file (no copy)
cfg = json.loads((a.src / "config.json").read_text())
cfg["quantization"] = QUANT                          # the ONLY new bytes on disk
(a.dst / "config.json").write_text(json.dumps(cfg, indent=2))
print(f"repacked {a.src} -> {a.dst} (experts int4 g32; attn/mlp/embed/lm_head bf16; no requant)")

Equivalent without the script:

SRC=~/models/<MODEL>-native; DST=~/models/<MODEL>-MLX
mkdir -p "$DST"
ln -s "$SRC"/* "$DST"/ && rm "$DST/config.json"
python3 -c 'import json,sys; c=json.load(open(sys.argv[1])); c["quantization"]={"group_size":32,"bits":4,"mode":"affine"}; json.dump(c,open(sys.argv[2],"w"),indent=2)' "$SRC/config.json" "$DST/config.json"

At load, mlx_lm reinterprets the packed INT4 (weight_packed.view(uint32), biases = -8*scales) and wraps only the expert modules as int4 QuantizedSwitchLinear; BF16 tensors stay BF16. (mlx_lm.convert is the other path — it dequantizes then requantizes, which changes the weights; this guide does not use it.)

Do this on both Macs at the same path, then continue.

1. The six hard-won gotchas (read before anything else)

en5 MTU must be 9000. The TB5 ring defaults to MTU 1500 after every reboot. At 1500, JACCL's _share_object (rank-0 → rank-1 prompt transfer) corrupts payloads under load — coherent output for a request or two, then gibberish (<br><br>3333…), then RuntimeError: share_object: payload corrupt after retries and a rank crash. Fix it AND persist it (§4.2).
Request the served model id, not the checkpoint path. mlx_lm.server advertises the model by its on-disk path. Sending that path as the "model" field triggers a broken distributed code path → corruption. Sending the built-in alias default_model uses the loaded model directly and is rock-solid. Any other string returns 404 and can desync/crash the distributed server. Clients send default_model.
--prefill-step-size is a memory landmine on bf16-attention models. Bigger = faster prompt processing, but the prefill activation peak scales with it. On a model with bf16 attention, 2048 OOMs; 64 works but is painfully slow (minutes for a 16K-token prompt). 512 is the safe middle. A model with 8-bit attention tolerates 2048.
An OOM crash leaks wired GPU memory — only a reboot clears it. When a rank hits [METAL] … Insufficient Memory, the process dies but its wired memory stays pinned (wired ≈ 492 GiB with no process holding it). Any relaunch then OOMs immediately. No userspace reclaim — reboot both Macs. Tune conservatively; every wrong guess costs a reboot.
A reboot also clears a "stuck" JACCL ring / wedged Metal state. If the ring won't form or a prior crash left things wedged, reboot both boxes. After reboot, log into each via Screen Sharing so Metal/GPU is available (a headless boot may not expose the GPU; user LaunchAgents need a GUI session).
Custom-tokenizer checkpoints need --trust-remote-code. compressed-tensors models carry an auto_map + custom tokenizer; without the flag the load throws ValueError: … contains custom code on every rank.

2. Topology

            ┌──────────────────────────┐         ┌──────────────────────────┐
 client ───▶│ studioA  <STUDIO_A_IP>   │◀═══════▶│ studioB  <STUDIO_B_IP>   │
 (OpenAI    │ rank 0 · HTTP :8000      │  TB5    │ rank 1 · compute-only    │
  /v1)      │ ~277 GiB weights         │  en5    │ ~277 GiB weights         │
            └──────────────────────────┘  JACCL  └──────────────────────────┘
                M3 Ultra · 512 GB RAM    RDMA        M3 Ultra · 512 GB RAM
                wired ceiling 493 GiB                wired ceiling 493 GiB

Rank 0 = studioA runs the HTTP server and computes. Rank 1 = studioB is compute-only (no HTTP listener of its own).
The model is tensor-sharded: each box holds ~half the weights and they run each forward pass in lockstep, exchanging activations over the ring.
Transport: JACCL over en5 (the direct Thunderbolt-5 cable, link-local 169.254.x) — not an Ethernet fabric.

3. Prerequisites (per Mac)

Component	Value used here
OS	macOS 26.x, GUI session logged in (Screen Sharing OK)
Python venv	`~/venvs/mlx-tp2` (must contain `mlx.launch`)
MLX	`mlx==0.31.2`, `mlx-lm==0.31.3`
Model class	loader must include your model class and `mx.distributed.is_available() == True`
Model on disk	identical path on both boxes, e.g. `~/models/<MODEL>-MLX` (APFS is case-insensitive)
SSH	passwordless `studioA` ⇄ `studioB` (mlx.launch SSHes rank 1)

~/venvs/mlx-tp2/bin/python -c \
 'import mlx.core as mx, mlx_lm; print("mlx",mx.__version__,"mlx_lm",mlx_lm.__version__,"dist",mx.distributed.is_available())'
ls ~/venvs/mlx-tp2/bin/mlx.launch

One-time host setup

4.1 Appliance settings (max memory + no Spotlight stalls)

Raise the IOGPU wired limit and quiet the OS so the big job gets the whole machine:

sudo sysctl -w iogpu.wired_limit_mb=505000      # ~493 GiB wired ceiling
sudo mdutil -i off /                             # Spotlight off (write bursts during load starve Metal)
sudo mdutil -i off /Volumes/* 2>/dev/null || true
sudo touch /.metadata_never_index

(iogpu.wired_limit_mb is not persistent by default — persist it with a boot LaunchDaemon like §4.2, or your provisioning.)

4.2 Persist `en5` MTU 9000 (THE critical fix)

The MTU resets to 1500 on every boot, so re-apply automatically. A root LaunchDaemon sets it at boot and re-asserts every 60s.

Immediate (both Macs): sudo ifconfig en5 mtu 9000

Persistent (both Macs, one-time):

sudo mkdir -p /usr/local/bin
printf '#!/bin/bash\nfor i in $(seq 1 15); do /sbin/ifconfig en5 >/dev/null 2>&1 && { /sbin/ifconfig en5 mtu 9000; break; }; sleep 2; done\n' | sudo tee /usr/local/bin/en5-mtu9000.sh
sudo chmod 755 /usr/local/bin/en5-mtu9000.sh
printf '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n<plist version="1.0"><dict><key>Label</key><string>com.local.en5-mtu9000</string><key>ProgramArguments</key><array><string>/bin/bash</string><string>/usr/local/bin/en5-mtu9000.sh</string></array><key>RunAtLoad</key><true/><key>StartInterval</key><integer>60</integer></dict></plist>\n' | sudo tee /Library/LaunchDaemons/com.local.en5-mtu9000.plist
sudo chown root:wheel /Library/LaunchDaemons/com.local.en5-mtu9000.plist && sudo chmod 644 /Library/LaunchDaemons/com.local.en5-mtu9000.plist
sudo launchctl bootstrap system /Library/LaunchDaemons/com.local.en5-mtu9000.plist || sudo launchctl load -w /Library/LaunchDaemons/com.local.en5-mtu9000.plist

Verify after a reboot: ifconfig en5 | grep mtu → mtu 9000, no manual step.

4.3 JACCL hostfile (`~/cluster/jaccl_hostfile.json` on studioA)

{
  "backend": "jaccl",
  "envs": ["MLX_METAL_FAST_SYNCH=1"],
  "hosts": [
    {"ssh": "<USER>@<STUDIO_A_IP>", "ips": ["<STUDIO_A_IP>"], "rdma": [null, "rdma_en5"]},
    {"ssh": "<USER>@<STUDIO_B_IP>", "ips": [],                "rdma": ["rdma_en5", null]}
  ]
}

The rdma_en5 entries pin the collective to the Thunderbolt link. Rank order here defines rank 0 (studioA) and rank 1 (studioB).

5. The launcher (`~/cluster/tp2-launch.sh` on studioA)

mlx.launch reads the hostfile, starts rank 0 locally, SSHes rank 1, and runs the same mlx_lm.server on both. Per-model defaults keep each model in its safe envelope:

#!/bin/bash
set -euo pipefail
MODEL_NAME=${1:?Usage: tp2-launch.sh MODEL_NAME [PORT]}
PORT=${2:-8000}

case "$MODEL_NAME" in
  # bf16-attention model -> 512 (2048 OOMs)
  <MODEL>-MLX)  VENV_NAME="mlx-tp2"; PREFILL_DEFAULT=512;  MAXTOK_DEFAULT=8192 ;;
  *) echo "Unsupported TP2 model: $MODEL_NAME" >&2; exit 64 ;;
esac
PREFILL=${TP2_PREFILL:-$PREFILL_DEFAULT}
MAX_TOKENS=${TP2_MAX_TOKENS:-$MAXTOK_DEFAULT}

LAUNCHER="$HOME/venvs/$VENV_NAME/bin/mlx.launch"
REMOTE_CMD=$(cat <<REMOTE
PY="\$HOME/venvs/$VENV_NAME/bin/python"
MODEL_PATH="\$HOME/models/$MODEL_NAME"
exec "\$PY" -m mlx_lm.server --model "\$MODEL_PATH" --host 0.0.0.0 --port $PORT \
  --max-tokens $MAX_TOKENS --prefill-step-size $PREFILL \
  --decode-concurrency 1 --prompt-concurrency 1 \
  --prompt-cache-size 0 --prompt-cache-bytes 0 --trust-remote-code
REMOTE
)
exec "$LAUNCHER" --verbose --cwd /tmp --hostfile "$HOME/cluster/jaccl_hostfile.json" \
  --no-verify-script -- env bash -c "$REMOTE_CMD"

Notes: --trust-remote-code (gotcha 6); --decode-concurrency 1 --prompt-concurrency 1 = single-stream (one request at a time); --prompt-cache-size 0 disables prefix caching (§9).

6. Bring-up

Pre-flight: both Macs powered, GUI logged in, wired ≈ 5–6 GiB (no leak), ifconfig en5 shows mtu 9000, iogpu.wired_limit_mb = 505000. If wired is high with no process → leak → reboot (gotcha 4).

# 0) clean stale ranks on both boxes
ssh studioA 'pkill -9 -f "mlx.launch|mlx_lm.server"; rm -f ~/cluster/tp2.log'
ssh studioB 'pkill -9 -f "mlx.launch|mlx_lm.server"'

# 1) launch (rank-0 box drives both ranks via the hostfile)
ssh studioA 'nohup ~/cluster/tp2-launch.sh <MODEL>-MLX 8000 > ~/cluster/tp2.log 2>&1 < /dev/null &'

# 2) watch load: tokenizer loads on BOTH ranks, then weights materialize (~2–4 min)
ssh studioA 'tail -f ~/cluster/tp2.log'   # look for "Reloaded tiktoken model from <studioB path>"

Healthy load reaches ~277 GiB resident per box under request load. A momentary idle imbalance (e.g. studioA=337 / studioB=6) is just MLX lazy materialization — both ranks engage when a request arrives.

7. Validation

# A) generation (ALWAYS model=default_model)
ssh studioA 'curl -s http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d "{\"model\":\"default_model\",\"messages\":[{\"role\":\"user\",\"content\":\"Capital of France? One word.\"}],\"max_tokens\":40,\"temperature\":0}"'

# B) streaming (SSE) — one "data:" line per token
ssh studioA 'curl -sN http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d "{\"model\":\"default_model\",\"stream\":true,\"messages\":[{\"role\":\"user\",\"content\":\"Count 1 to 6.\"}],\"max_tokens\":50}" | grep "^data:"'

# C) corruption/OOM scan (expect empty)
ssh studioA 'grep -E "payload corrupt|OutOfMemory|exited with code" ~/cluster/tp2.log'

8. Client wiring (OpenAI-compatible, e.g. OpenCode)

"studio_tp2": {
  "npm": "@ai-sdk/openai-compatible",
  "name": "Studio TP=2 <MODEL>",
  "options": { "baseURL": "http://<STUDIO_A_IP>:8000/v1", "apiKey": "EMPTY", "timeout": 900000, "chunkTimeout": 120000 },
  "models": {
    "default_model": {                      // MUST be default_model (gotcha 2)
      "name": "<MODEL> (TP=2)",
      "reasoning": true, "tool_call": false, "temperature": true,
      "limit": { "context": 131072, "output": 8192 }
    }
  }
}

mlx_lm.server streams by default; tool_call:false (raw server has no OpenAI function-calling — route through a tool adapter if you need agentic tools).

9. Sizing & known limitations

Per-box budget: 493 GiB wired ceiling − ~277 GiB weights ≈ 216 GiB for KV cache + activations + prefill scratch. MLA (low-rank compressed KV) makes KV cheap per token; the constraint is the prefill activation peak (set by --prefill-step-size), not KV. A smaller-on-disk quant can peak higher at runtime — budget the runtime peak, not the disk footprint.

Limitations: single-stream (one request at a time); no prefix cache (a growing conversation re-prefills each turn); not boot-persistent (the serve is nohup — relaunch after reboot); occupies both boxes (can't coexist with another TP=2 model on the pair).

10. Troubleshooting matrix

Symptom	Cause	Fix
Output coherent → gibberish → `payload corrupt after retries`, rank crash	`en5` MTU = 1500	`sudo ifconfig en5 mtu 9000` + LaunchDaemon (§4.2)
404 then server unstable/crashes	client sent a non-`default_model` id	clients send `default_model`
`[METAL] … Insufficient Memory`, rank exits 255	prefill batch too large for bf16-attn model	lower `--prefill-step-size` (512)
Relaunch instantly OOMs; `wired ≈ 492 GiB`, no process	leaked wired memory from a prior OOM	reboot both Macs
`ValueError: … contains custom code` on load	custom-tokenizer checkpoint	add `--trust-remote-code`
Ring won't form / `[jaccl] Recv failed` / both ranks idle	wedged JACCL/Metal (often post-crash)	reboot both, relaunch
First message "hangs" minutes	big prompt × tiny `--prefill-step-size`	raise prefill-step-size (within the OOM-safe envelope)
`/v1/models` 200 but generation hangs/empty	a rank died at decode	check log for crash class; relaunch (reboot if leaked)

11. Measured performance (Kimi-K2.7-Code, this setup)

Metric	Result
Decode (generation)	~20 tok/s sustained
Prefill (prompt processing)	~286 tok/s (7,016-token prompt in ~24.5s)
TTFT, warm short prompt	~0.3s
TTFT, ~2K-token prompt	~3s
First request (cold)	+~5s one-time (graph compile)

12. Crash-recovery discipline

This stack is fragile to crashes: an OOM or corruption crash can pin wired memory, and the clean recovery is a reboot of both Macs. Operating rules: (1) tune --prefill-step-size conservatively, raise in small steps; (2) clients send only default_model; (3) keep MTU 9000 persisted so it's never the variable; (4) if wired is stuck high with no process, reboot, log into the GUI, confirm en5 mtu 9000 auto-applied, relaunch.

mp3pintyo

3 days ago

Generation speed its ok but prompt processing is so slow (~286 tok/s) that it is completely unusable. :(

x-polyglot-x

2 days ago

Generation speed its ok but prompt processing is so slow (~286 tok/s) that it is completely unusable. :(

If nearly 300 tokens/sec prompt processing is unusable to you on a 1 trillion parameter SOTA model, then I have no idea what to tell you.

whistlercapital

2 days ago

tried a handful of different ways to accelerate the prompt processing but until an upgrade to M5Ultra, its a metal/silicon limit particular to apple. The other avenue is to train an eagle3 model as drafter and convert to MLX. I was planning on using a 2.6 version of the drafter to try out MTP to see what the matching rate is, will have hands on a DGX Station soon and could likely run the training to get to an optimized drafter.

mp3pintyo

2 days ago

I don't understand you. In a coding or agent task, the input jumps to 60-80k tokens in a matter of seconds. So I have to wait 4 minutes each time for it to even process the input!
My own machine has an nvidia 3090 and even 2-3000 pp (llama.cpp/qwen3.6) slows down the task terribly in the case of a hermes agent.

x-polyglot-x

2 days ago

I don't understand you. In a coding or agent task, the input jumps to 60-80k tokens in a matter of seconds. So I have to wait 4 minutes each time for it to even process the input!
My own machine has an nvidia 3090 and even 2-3000 pp (llama.cpp/qwen3.6) slows down the task terribly in the case of a hermes agent.

The input jumps to 60k tokens because you have no idea how to configure an intelligent prompt would be my guess. I have a 5090 that I use regularly for quick jobs, but when I need careful and thorough analysis of a code, I do not dump the entire codebase into the AI and say, "Have fun and fix it, thanks!"

I also have no idea why you'd think that 4 minutes is some eternity to process 80k tokens! That's crazy talk. This is a SOTA model, routinely ranked in the top 10 models in the entire world. And you cannot wait 4 minutes for it solve something for you?

In looking at your own profile, I see that you have a list of models sized 0.8B - 2B. Models that size give you a fast response that's nonsense. Why even bother with them? You understand how much better this model is, right?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

MLX g32 bf16 conversion for TP=2 Mac Studios @20.1 tok/s average

Tensor-Parallel Serving Across Two Mac Studios (MLX + JACCL / Thunderbolt 5)

0. Convert the checkpoint to MLX (no quantization) — run first, on each Mac

1. The six hard-won gotchas (read before anything else)

2. Topology

3. Prerequisites (per Mac)

One-time host setup

4.1 Appliance settings (max memory + no Spotlight stalls)

4.2 Persist en5 MTU 9000 (THE critical fix)

4.3 JACCL hostfile (~/cluster/jaccl_hostfile.json on studioA)

5. The launcher (~/cluster/tp2-launch.sh on studioA)

6. Bring-up

7. Validation

8. Client wiring (OpenAI-compatible, e.g. OpenCode)

9. Sizing & known limitations

10. Troubleshooting matrix

11. Measured performance (Kimi-K2.7-Code, this setup)

12. Crash-recovery discipline

4.2 Persist `en5` MTU 9000 (THE critical fix)

4.3 JACCL hostfile (`~/cluster/jaccl_hostfile.json` on studioA)

5. The launcher (`~/cluster/tp2-launch.sh` on studioA)