BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal
Apple silicon has become one of the most capable platforms for local LLM inference due to large unified memory, high memory bandwidth, and a mature GPU compute stack. But the runtimes people actually use on Mac (llama.cpp, MLX-based stacks) don't fully utilize Metal's execution model. They carry abstractions, such as cross-platform CPU-first codebases, lazy-evaluated array frameworks or generic scheduling layers, that leave real performance on the table.
BaseRT is our answer: a from-scratch AI inference runtime written directly against Apple's Metal API, with zero dependency on MLX, PyTorch, CoreML, or any other intermediate framework. Without needless abstractions, lazy graph evaluations or generic dispatch loops, BaseRT ships chip-specific kernels and a decode loop that does nothing but running your model.
The result, benchmarked across the Qwen3, Llama 3.2, and Gemma 4 families on M3 and M4 Pro devices, is the highest LLM inference throughput reported on Apple silicon to date: up to 1.56× faster decode than llama.cpp, up to 1.35× faster decode than MLX, and up to 1.81× faster prefill on mixture-of-experts models.
- Paper: BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal
- Code: github.com/basecompute/baseRT
- Docs: docs.basecompute.co
What's different about BaseRT
BaseRT is a C++ runtime that targets Metal directly, with a stable C API for cross-language interoperability (Python, Node, Rust, and Swift bindings ship in the repo). The design is built around a few core ideas, each aimed at a specific overhead we identified in existing runtimes.
1. Architecture support as data
Adding a new model family to a typical framework-based runtime usually means new branches in the inference loop. BaseRT instead expresses everything that varies between architectures, i.e. activation and normalization variants, MoE routing specifics, attention and positional-encoding conventions, as a compact architecture descriptor. The core engine consumes the descriptor and never branches on model identity, so the hot path is byte-for-byte identical regardless of which model is loaded. BaseRT currently ships support for LLaMA, Qwen3, Gemma, Whisper, and BERT families, and adding a new one is a self-contained, declarative change.
2. A zero-allocation decode loop
Once a model is loaded, the decode loop allocates zero bytes. Residual, attention, and feed-forward scratch buffers, logits and token buffers, and even the KV cache (pre-sized to max context length, laid out for coalesced attention reads) are all allocated once at load time and reused for every token. Even error handling routes through a static thread-local buffer. What's left on the hot path is GPU command dispatch and lightweight CPU-side patching.
3. Hand-fused, hand-written Metal kernels
BaseRT ships a large library of Metal shaders across matmul, attention, normalization, RoPE, embedding, activation, and sampling. The matmul kernels are the core of it: dedicated GEMV kernels for decode (M=1) and GEMM kernels for prefill (M>1), one specialized variant per quantization format from 2-bit through 16-bit. Dequantization happens inline in the compute loop rather than materializing full-precision weights to memory first, so memory traffic scales down with the compression ratio rather than staying fixed.
On top of that, BaseRT fuses operator sequences that other runtimes typically dispatch as separate kernels, both along the attention path and inside the feed-forward block. Each fusion removes a kernel launch and a global-memory round trip, which matters most during decode, where individual kernels are short-lived and launch overhead is proportionally large. Kernel and launch-geometry selection is hardware-adaptive, tuned per chip generation from M1 through M5.
4. Compute-bound prefill, done properly
Prompt processing is GEMM-bound instead of memory-bound, so BaseRT uses tiled GEMM kernels built on Metal's simdgroup matrix intrinsics, with tile geometry tuned to sequence length, plus chunked prefill to bound scratch memory. Attention uses a FlashAttention-style kernel with online softmax, so memory scales linearly with sequence length instead of quadratically.
Benchmarks
We benchmarked six models, namely Qwen3-0.6B, Llama-3.2-1B, Llama-3.2-3B, Gemma-4-E2B (dense), and Gemma-4-26B-A4B and Qwen3-30B-A3B (mixture-of-experts), at Q4 and Q8 quantization, against llama.cpp and MLX on an M4 Pro (16 GPU cores, 24GB unified memory), with cross-generation results on an M3 base. We also compared against uzu, a native Metal runtime from Mirai.
Decode throughput (M4 Pro)
| Model | Quant | BaseRT | llama.cpp | vs. llama.cpp | MLX | vs. MLX |
|---|---|---|---|---|---|---|
| Qwen3 0.6B | Q4 | 464.5 | 297.4 | 1.56× | 343.6 | 1.35× |
| Qwen3 0.6B | Q8 | 321.2 | 219.8 | 1.46× | 255.3 | 1.26× |
| Llama 3.2 1B | Q4 | 295.4 | 230.4 | 1.28× | 257.8 | 1.15× |
| Llama 3.2 1B | Q8 | 183.8 | 160.7 | 1.14× | 159.2 | 1.15× |
| Llama 3.2 3B | Q4 | 117.3 | 102.4 | 1.15× | 112.1 | 1.05× |
| Llama 3.2 3B | Q8 | 70.9 | 65.1 | 1.09× | 65.5 | 1.08× |
| Gemma 4 E2B | Q4 | 127.7 | 107.0 | 1.19× | — | — |
| Gemma 4 E2B | Q8 | 84.5 | 59.5 | 1.42× | — | — |
| Gemma 4 26B-A4B | Q4 | 62.2 | 58.0 | 1.07× | 69.3 | 0.90× |
| Qwen3 30B-A3B | Q4 | 84.1 | 80.7 | 1.04× | 83.1 | 1.01× |
(tokens/sec, tg128 — 128 generated tokens)
The pattern is consistent: BaseRT's advantage is largest on smaller dense models, where fixed per-token dispatch overhead is a bigger share of total latency: up to 56% faster than llama.cpp on Qwen3-0.6B. It narrows on the larger MoE models as decode becomes memory-bandwidth-bound rather than dispatch-bound, which is exactly what you'd expect once the bottleneck moves from software overhead to hardware limits.
Prefill throughput
Prefill tells a more interesting story. On the smaller dense models, BaseRT, llama.cpp, and MLX all land within a few percent of each other. GEMM-bound prefill saturates the GPU's matmul units regardless of runtime, so there's less overhead to eliminate. On the mixture-of-experts models, though, BaseRT has a significant lead: up to 1.81× over llama.cpp and 1.78× over MLX on Qwen3-30B-A3B at a 128-token prompt, with the margin narrowing (but persisting) as prompts get longer.
Cross-generation consistency (M3 base)
The same pattern holds on the older M3 base chip (8 GPU cores, 100 GB/s bandwidth): BaseRT leads llama.cpp on decode across all eight tested configurations (1.13–1.34×) and MLX on all six supported configurations (1.01–1.22×), confirming the gains aren't specific to the M4 Pro chip.
Comparison with uzu
Against uzu, which is the closest architectural peer being Metal-native, BaseRT leads decode on 5 of 6 configurations (up to 1.19× on Qwen3-0.6B), narrowing to near-parity on larger Llama models. On prefill, uzu leads on most configurations by 4–13% on Qwen3-0.6B. uzu routes part of its workload through MPSGraph, which can dispatch to the Apple Neural Engine for additional GEMM throughput, which BaseRT's GPU-only path doesn't utilize (yet).
Try it
BaseRT ships as a CLI, a C API, and language bindings, and is easily installed and run via:
# Install
curl -LsSf https://basecompute.co/install.sh | sh
# Pull a model straight from Hugging Face and convert it to BaseRT's .base format
basert pull Qwen/Qwen3-4B
# Chat with it
basert chat Qwen/Qwen3-4B
# Or serve an OpenAI-compatible API
basert serve Qwen/Qwen3-4B --api-key "$(uuidgen)" --port 8080
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen3-4B","messages":[{"role":"user","content":"Hello!"}]}'
Or call it directly from Python:
import baseRT
model = baseRT.Model("models/your-model.base")
print(model.generate_text("The capital of France is", max_tokens=64))
The OpenAI-compatible server also supports embeddings, transcription, tool calls, continuous batching, paged-KV cache, and prefix caching, and BaseRT doubles as a local backend for the pi coding agent via the pi-basert extension.
Get involved
We'd love for you to try it on your own models, benchmark it against your current stack, and open issues or PRs for architectures you want supported.