MLX g32 bf16 conversion for TP=2 Mac Studios @20.1 tok/s average

#7
by whistlercapital - opened

For no loss due to dequant/requant, simple repackaging to MLX a quick guide to the extent helpful, now working smoothly with opencode:

Tensor-Parallel Serving Across Two Mac Studios (MLX + JACCL / Thunderbolt 5)

What this is: a battle-tested guide to serving the model across two Mac Studio M3 Ultra boxes as a single OpenAI-compatible endpoint, using MLX tensor parallelism (TP=2) over Apple's JACCL collective library on a Thunderbolt-5 RDMA ring.

It is written from a real bring-up that took many hours and several reboots. The Β§1 gotchas are the whole point β€” every one cost us time. Read them first.

Placeholders used throughout (substitute your own): <USER> = your macOS username Β· studioA / <STUDIO_A_IP> = the Mac running rank 0 (HTTP) Β· studioB / <STUDIO_B_IP> = rank 1 (compute) Β· ~ = home Β· ~/venvs/mlx-tp2 = the MLX venv Β· ~/cluster/ = host-side config dir Β· ~/models/ = model dir.


0. Convert the checkpoint to MLX (no quantization) β€” run first, on each Mac

Many large MoE checkpoints ship as compressed-tensors (INT4 group-32 routed experts + BF16 everything else). The modern mlx_lm loader reads that format natively, so "conversion" is just adding a quantization block to config.json β€” no weight bytes change.

# On each Mac. Symlinks the native shards into <dst>, writes a patched config.json.
python3 repack_mlx.py --src ~/models/<MODEL>-native --dst ~/models/<MODEL>-MLX

repack_mlx.py (self-contained):

#!/usr/bin/env python3
import argparse, json, os
from pathlib import Path
QUANT = {"group_size": 32, "bits": 4, "mode": "affine"}  # match the native expert grid
ap = argparse.ArgumentParser()
ap.add_argument("--src", required=True, type=Path)   # native compressed-tensors checkpoint
ap.add_argument("--dst", required=True, type=Path)   # MLX output dir (symlink farm)
a = ap.parse_args()
a.dst.mkdir(parents=True, exist_ok=True)
for f in sorted(a.src.iterdir()):
    if f.name == "config.json":
        continue
    t = a.dst / f.name
    if t.is_symlink() or t.exists():
        t.unlink()
    os.symlink(f.resolve(), t)                       # symlink every shard/file (no copy)
cfg = json.loads((a.src / "config.json").read_text())
cfg["quantization"] = QUANT                          # the ONLY new bytes on disk
(a.dst / "config.json").write_text(json.dumps(cfg, indent=2))
print(f"repacked {a.src} -> {a.dst} (experts int4 g32; attn/mlp/embed/lm_head bf16; no requant)")

Equivalent without the script:

SRC=~/models/<MODEL>-native; DST=~/models/<MODEL>-MLX
mkdir -p "$DST"
ln -s "$SRC"/* "$DST"/ && rm "$DST/config.json"
python3 -c 'import json,sys; c=json.load(open(sys.argv[1])); c["quantization"]={"group_size":32,"bits":4,"mode":"affine"}; json.dump(c,open(sys.argv[2],"w"),indent=2)' "$SRC/config.json" "$DST/config.json"

At load, mlx_lm reinterprets the packed INT4 (weight_packed.view(uint32), biases = -8*scales) and wraps only the expert modules as int4 QuantizedSwitchLinear; BF16 tensors stay BF16. (mlx_lm.convert is the other path β€” it dequantizes then requantizes, which changes the weights; this guide does not use it.)

Do this on both Macs at the same path, then continue.


1. The six hard-won gotchas (read before anything else)

  1. en5 MTU must be 9000. The TB5 ring defaults to MTU 1500 after every reboot. At 1500, JACCL's _share_object (rank-0 β†’ rank-1 prompt transfer) corrupts payloads under load β€” coherent output for a request or two, then gibberish (<br><br>3333…), then RuntimeError: share_object: payload corrupt after retries and a rank crash. Fix it AND persist it (Β§4.2).

  2. Request the served model id, not the checkpoint path. mlx_lm.server advertises the model by its on-disk path. Sending that path as the "model" field triggers a broken distributed code path β†’ corruption. Sending the built-in alias default_model uses the loaded model directly and is rock-solid. Any other string returns 404 and can desync/crash the distributed server. Clients send default_model.

  3. --prefill-step-size is a memory landmine on bf16-attention models. Bigger = faster prompt processing, but the prefill activation peak scales with it. On a model with bf16 attention, 2048 OOMs; 64 works but is painfully slow (minutes for a 16K-token prompt). 512 is the safe middle. A model with 8-bit attention tolerates 2048.

  4. An OOM crash leaks wired GPU memory β€” only a reboot clears it. When a rank hits [METAL] … Insufficient Memory, the process dies but its wired memory stays pinned (wired β‰ˆ 492 GiB with no process holding it). Any relaunch then OOMs immediately. No userspace reclaim β€” reboot both Macs. Tune conservatively; every wrong guess costs a reboot.

  5. A reboot also clears a "stuck" JACCL ring / wedged Metal state. If the ring won't form or a prior crash left things wedged, reboot both boxes. After reboot, log into each via Screen Sharing so Metal/GPU is available (a headless boot may not expose the GPU; user LaunchAgents need a GUI session).

  6. Custom-tokenizer checkpoints need --trust-remote-code. compressed-tensors models carry an auto_map + custom tokenizer; without the flag the load throws ValueError: … contains custom code on every rank.


2. Topology

            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 client ───▢│ studioA  <STUDIO_A_IP>   │◀═══════▢│ studioB  <STUDIO_B_IP>   β”‚
 (OpenAI    β”‚ rank 0 Β· HTTP :8000      β”‚  TB5    β”‚ rank 1 Β· compute-only    β”‚
  /v1)      β”‚ ~277 GiB weights         β”‚  en5    β”‚ ~277 GiB weights         β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  JACCL  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                M3 Ultra Β· 512 GB RAM    RDMA        M3 Ultra Β· 512 GB RAM
                wired ceiling 493 GiB                wired ceiling 493 GiB
  • Rank 0 = studioA runs the HTTP server and computes. Rank 1 = studioB is compute-only (no HTTP listener of its own).
  • The model is tensor-sharded: each box holds ~half the weights and they run each forward pass in lockstep, exchanging activations over the ring.
  • Transport: JACCL over en5 (the direct Thunderbolt-5 cable, link-local 169.254.x) β€” not an Ethernet fabric.

3. Prerequisites (per Mac)

Component Value used here
OS macOS 26.x, GUI session logged in (Screen Sharing OK)
Python venv ~/venvs/mlx-tp2 (must contain mlx.launch)
MLX mlx==0.31.2, mlx-lm==0.31.3
Model class loader must include your model class and mx.distributed.is_available() == True
Model on disk identical path on both boxes, e.g. ~/models/<MODEL>-MLX (APFS is case-insensitive)
SSH passwordless studioA ⇄ studioB (mlx.launch SSHes rank 1)
~/venvs/mlx-tp2/bin/python -c \
 'import mlx.core as mx, mlx_lm; print("mlx",mx.__version__,"mlx_lm",mlx_lm.__version__,"dist",mx.distributed.is_available())'
ls ~/venvs/mlx-tp2/bin/mlx.launch

One-time host setup

4.1 Appliance settings (max memory + no Spotlight stalls)

Raise the IOGPU wired limit and quiet the OS so the big job gets the whole machine:

sudo sysctl -w iogpu.wired_limit_mb=505000      # ~493 GiB wired ceiling
sudo mdutil -i off /                             # Spotlight off (write bursts during load starve Metal)
sudo mdutil -i off /Volumes/* 2>/dev/null || true
sudo touch /.metadata_never_index

(iogpu.wired_limit_mb is not persistent by default β€” persist it with a boot LaunchDaemon like Β§4.2, or your provisioning.)

4.2 Persist en5 MTU 9000 (THE critical fix)

The MTU resets to 1500 on every boot, so re-apply automatically. A root LaunchDaemon sets it at boot and re-asserts every 60s.

Immediate (both Macs): sudo ifconfig en5 mtu 9000

Persistent (both Macs, one-time):

sudo mkdir -p /usr/local/bin
printf '#!/bin/bash\nfor i in $(seq 1 15); do /sbin/ifconfig en5 >/dev/null 2>&1 && { /sbin/ifconfig en5 mtu 9000; break; }; sleep 2; done\n' | sudo tee /usr/local/bin/en5-mtu9000.sh
sudo chmod 755 /usr/local/bin/en5-mtu9000.sh
printf '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n<plist version="1.0"><dict><key>Label</key><string>com.local.en5-mtu9000</string><key>ProgramArguments</key><array><string>/bin/bash</string><string>/usr/local/bin/en5-mtu9000.sh</string></array><key>RunAtLoad</key><true/><key>StartInterval</key><integer>60</integer></dict></plist>\n' | sudo tee /Library/LaunchDaemons/com.local.en5-mtu9000.plist
sudo chown root:wheel /Library/LaunchDaemons/com.local.en5-mtu9000.plist && sudo chmod 644 /Library/LaunchDaemons/com.local.en5-mtu9000.plist
sudo launchctl bootstrap system /Library/LaunchDaemons/com.local.en5-mtu9000.plist || sudo launchctl load -w /Library/LaunchDaemons/com.local.en5-mtu9000.plist

Verify after a reboot: ifconfig en5 | grep mtu β†’ mtu 9000, no manual step.

4.3 JACCL hostfile (~/cluster/jaccl_hostfile.json on studioA)

{
  "backend": "jaccl",
  "envs": ["MLX_METAL_FAST_SYNCH=1"],
  "hosts": [
    {"ssh": "<USER>@<STUDIO_A_IP>", "ips": ["<STUDIO_A_IP>"], "rdma": [null, "rdma_en5"]},
    {"ssh": "<USER>@<STUDIO_B_IP>", "ips": [],                "rdma": ["rdma_en5", null]}
  ]
}

The rdma_en5 entries pin the collective to the Thunderbolt link. Rank order here defines rank 0 (studioA) and rank 1 (studioB).


5. The launcher (~/cluster/tp2-launch.sh on studioA)

mlx.launch reads the hostfile, starts rank 0 locally, SSHes rank 1, and runs the same mlx_lm.server on both. Per-model defaults keep each model in its safe envelope:

#!/bin/bash
set -euo pipefail
MODEL_NAME=${1:?Usage: tp2-launch.sh MODEL_NAME [PORT]}
PORT=${2:-8000}

case "$MODEL_NAME" in
  # bf16-attention model -> 512 (2048 OOMs)
  <MODEL>-MLX)  VENV_NAME="mlx-tp2"; PREFILL_DEFAULT=512;  MAXTOK_DEFAULT=8192 ;;
  *) echo "Unsupported TP2 model: $MODEL_NAME" >&2; exit 64 ;;
esac
PREFILL=${TP2_PREFILL:-$PREFILL_DEFAULT}
MAX_TOKENS=${TP2_MAX_TOKENS:-$MAXTOK_DEFAULT}

LAUNCHER="$HOME/venvs/$VENV_NAME/bin/mlx.launch"
REMOTE_CMD=$(cat <<REMOTE
PY="\$HOME/venvs/$VENV_NAME/bin/python"
MODEL_PATH="\$HOME/models/$MODEL_NAME"
exec "\$PY" -m mlx_lm.server --model "\$MODEL_PATH" --host 0.0.0.0 --port $PORT \
  --max-tokens $MAX_TOKENS --prefill-step-size $PREFILL \
  --decode-concurrency 1 --prompt-concurrency 1 \
  --prompt-cache-size 0 --prompt-cache-bytes 0 --trust-remote-code
REMOTE
)
exec "$LAUNCHER" --verbose --cwd /tmp --hostfile "$HOME/cluster/jaccl_hostfile.json" \
  --no-verify-script -- env bash -c "$REMOTE_CMD"

Notes: --trust-remote-code (gotcha 6); --decode-concurrency 1 --prompt-concurrency 1 = single-stream (one request at a time); --prompt-cache-size 0 disables prefix caching (Β§9).


6. Bring-up

Pre-flight: both Macs powered, GUI logged in, wired β‰ˆ 5–6 GiB (no leak), ifconfig en5 shows mtu 9000, iogpu.wired_limit_mb = 505000. If wired is high with no process β†’ leak β†’ reboot (gotcha 4).

# 0) clean stale ranks on both boxes
ssh studioA 'pkill -9 -f "mlx.launch|mlx_lm.server"; rm -f ~/cluster/tp2.log'
ssh studioB 'pkill -9 -f "mlx.launch|mlx_lm.server"'

# 1) launch (rank-0 box drives both ranks via the hostfile)
ssh studioA 'nohup ~/cluster/tp2-launch.sh <MODEL>-MLX 8000 > ~/cluster/tp2.log 2>&1 < /dev/null &'

# 2) watch load: tokenizer loads on BOTH ranks, then weights materialize (~2–4 min)
ssh studioA 'tail -f ~/cluster/tp2.log'   # look for "Reloaded tiktoken model from <studioB path>"

Healthy load reaches ~277 GiB resident per box under request load. A momentary idle imbalance (e.g. studioA=337 / studioB=6) is just MLX lazy materialization β€” both ranks engage when a request arrives.


7. Validation

# A) generation (ALWAYS model=default_model)
ssh studioA 'curl -s http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d "{\"model\":\"default_model\",\"messages\":[{\"role\":\"user\",\"content\":\"Capital of France? One word.\"}],\"max_tokens\":40,\"temperature\":0}"'

# B) streaming (SSE) β€” one "data:" line per token
ssh studioA 'curl -sN http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d "{\"model\":\"default_model\",\"stream\":true,\"messages\":[{\"role\":\"user\",\"content\":\"Count 1 to 6.\"}],\"max_tokens\":50}" | grep "^data:"'

# C) corruption/OOM scan (expect empty)
ssh studioA 'grep -E "payload corrupt|OutOfMemory|exited with code" ~/cluster/tp2.log'

8. Client wiring (OpenAI-compatible, e.g. OpenCode)

"studio_tp2": {
  "npm": "@ai-sdk/openai-compatible",
  "name": "Studio TP=2 <MODEL>",
  "options": { "baseURL": "http://<STUDIO_A_IP>:8000/v1", "apiKey": "EMPTY", "timeout": 900000, "chunkTimeout": 120000 },
  "models": {
    "default_model": {                      // MUST be default_model (gotcha 2)
      "name": "<MODEL> (TP=2)",
      "reasoning": true, "tool_call": false, "temperature": true,
      "limit": { "context": 131072, "output": 8192 }
    }
  }
}

mlx_lm.server streams by default; tool_call:false (raw server has no OpenAI function-calling β€” route through a tool adapter if you need agentic tools).


9. Sizing & known limitations

Per-box budget: 493 GiB wired ceiling βˆ’ ~277 GiB weights β‰ˆ 216 GiB for KV cache + activations + prefill scratch. MLA (low-rank compressed KV) makes KV cheap per token; the constraint is the prefill activation peak (set by --prefill-step-size), not KV. A smaller-on-disk quant can peak higher at runtime β€” budget the runtime peak, not the disk footprint.

Limitations: single-stream (one request at a time); no prefix cache (a growing conversation re-prefills each turn); not boot-persistent (the serve is nohup β€” relaunch after reboot); occupies both boxes (can't coexist with another TP=2 model on the pair).


10. Troubleshooting matrix

Symptom Cause Fix
Output coherent β†’ gibberish β†’ payload corrupt after retries, rank crash en5 MTU = 1500 sudo ifconfig en5 mtu 9000 + LaunchDaemon (Β§4.2)
404 then server unstable/crashes client sent a non-default_model id clients send default_model
[METAL] … Insufficient Memory, rank exits 255 prefill batch too large for bf16-attn model lower --prefill-step-size (512)
Relaunch instantly OOMs; wired β‰ˆ 492 GiB, no process leaked wired memory from a prior OOM reboot both Macs
ValueError: … contains custom code on load custom-tokenizer checkpoint add --trust-remote-code
Ring won't form / [jaccl] Recv failed / both ranks idle wedged JACCL/Metal (often post-crash) reboot both, relaunch
First message "hangs" minutes big prompt Γ— tiny --prefill-step-size raise prefill-step-size (within the OOM-safe envelope)
/v1/models 200 but generation hangs/empty a rank died at decode check log for crash class; relaunch (reboot if leaked)

11. Measured performance (Kimi-K2.7-Code, this setup)

Metric Result
Decode (generation) ~20 tok/s sustained
Prefill (prompt processing) ~286 tok/s (7,016-token prompt in ~24.5s)
TTFT, warm short prompt ~0.3s
TTFT, ~2K-token prompt ~3s
First request (cold) +~5s one-time (graph compile)

12. Crash-recovery discipline

This stack is fragile to crashes: an OOM or corruption crash can pin wired memory, and the clean recovery is a reboot of both Macs. Operating rules: (1) tune --prefill-step-size conservatively, raise in small steps; (2) clients send only default_model; (3) keep MTU 9000 persisted so it's never the variable; (4) if wired is stuck high with no process, reboot, log into the GUI, confirm en5 mtu 9000 auto-applied, relaunch.

Generation speed its ok but prompt processing is so slow (~286 tok/s) that it is completely unusable. :(

Generation speed its ok but prompt processing is so slow (~286 tok/s) that it is completely unusable. :(

If nearly 300 tokens/sec prompt processing is unusable to you on a 1 trillion parameter SOTA model, then I have no idea what to tell you.

tried a handful of different ways to accelerate the prompt processing but until an upgrade to M5Ultra, its a metal/silicon limit particular to apple. The other avenue is to train an eagle3 model as drafter and convert to MLX. I was planning on using a 2.6 version of the drafter to try out MTP to see what the matching rate is, will have hands on a DGX Station soon and could likely run the training to get to an optimized drafter.

I don't understand you. In a coding or agent task, the input jumps to 60-80k tokens in a matter of seconds. So I have to wait 4 minutes each time for it to even process the input!
My own machine has an nvidia 3090 and even 2-3000 pp (llama.cpp/qwen3.6) slows down the task terribly in the case of a hermes agent.

I don't understand you. In a coding or agent task, the input jumps to 60-80k tokens in a matter of seconds. So I have to wait 4 minutes each time for it to even process the input!
My own machine has an nvidia 3090 and even 2-3000 pp (llama.cpp/qwen3.6) slows down the task terribly in the case of a hermes agent.

The input jumps to 60k tokens because you have no idea how to configure an intelligent prompt would be my guess. I have a 5090 that I use regularly for quick jobs, but when I need careful and thorough analysis of a code, I do not dump the entire codebase into the AI and say, "Have fun and fix it, thanks!"

I also have no idea why you'd think that 4 minutes is some eternity to process 80k tokens! That's crazy talk. This is a SOTA model, routinely ranked in the top 10 models in the entire world. And you cannot wait 4 minutes for it solve something for you?

In looking at your own profile, I see that you have a list of models sized 0.8B - 2B. Models that size give you a fast response that's nonsense. Why even bother with them? You understand how much better this model is, right?

Sign up or log in to comment