LFM2.5-8B-A1B β Core AI (the zoo's first MoE on iPhone)
Apple Core AI (.aimodel) conversion of LiquidAI/LFM2.5-8B-A1B:
a conv + full-attention MoE hybrid decoder (24 layers = 18 short-conv mixers + 6 GQA attention;
hidden 2048, vocab 128k; first 2 layers dense, the rest 32-expert top-4 sparse MoE). 8.3B total /
~1.5B active per token.
Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo
(full card: zoo/lfm2.5-8b-a1b-moe.md).
The gather_qmm kernel
MoE decode normally reads all 32 experts' weights every token via the GatherMM composite even
though only the top-4 are routed β bandwidth-bound at 39 tok/s. This bundle uses a custom
coreai_torch.TorchMetalKernel that takes the routed indices as a kernel input and reads only the
4 routed experts' weight slabs β 3.6Γ faster (141 tok/s) at the same active-param bandwidth.
Bundles & honest quality
Shipped here (Mac-only):
| dir | size | platform | decode tok/s | quality (fp32-oracle margin gate) |
|---|---|---|---|---|
gpu-pipelined/lfm2_5_8b_a1b_decode_sym8_gather/ |
8.8 GB | Mac | 140 | CLEAN β +1 flip/41 (= fp16 ceiling) β |
Honest bottom line. The sym8 (symmetric-linear int8) Mac bundle is both 3.6Γ faster AND
clean β at the fp16 ceiling, matching the shipped int8-linear quality. The kernel itself is
bit-exact; quality is purely the expert quantization scheme. An int4 bundle (4.7 GB) was validated
to run on the iPhone 17 Pro (~32 tok/s, the first MoE on the phone) β but the iPhone needs int4
for size and non-QAT int4 is a hard quality wall (two independent 4-bit schemes both land at ~12
introduced flips/41 with large margins; clean int4 would need QAT weights LiquidAI doesn't ship).
So only the clean Mac bundle is shipped; rebuild the int4 variant locally if you want the
on-device version. On a bare prompt the base model itself greedy-degenerates into repetition
(present in fp16 too) β use the chat template + sampling.
Run
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/lfm2_5_8b_a1b_decode_sym8_gather -p 128 -g 256 -n 3
The decode graph's input_ids is static [1,1]; prefill runs as S=1 pipelined steps. Convert your
own with conversion/export_lfm2_moe_metal_decode_pipelined.py
(sym8 = clean Mac; int4km = iPhone-compact, not shipped).
License
LFM Open License v1.0 (upstream LiquidAI license, shipped as LICENSE). Conversion/kernel: community.
- Downloads last month
- 77