LFM2.5-8B-A1B — Core AI (the zoo's first MoE on iPhone)

Apple Core AI (.aimodel) conversion of LiquidAI/LFM2.5-8B-A1B: a conv + full-attention MoE hybrid decoder (24 layers = 18 short-conv mixers + 6 GQA attention; hidden 2048, vocab 128k; first 2 layers dense, the rest 32-expert top-4 sparse MoE). 8.3B total / ~1.5B active per token.

Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo (full card: zoo/lfm2.5-8b-a1b-moe.md).

The `gather_qmm` kernel

MoE decode normally reads all 32 experts' weights every token via the GatherMM composite even though only the top-4 are routed — bandwidth-bound at 39 tok/s. This bundle uses a custom coreai_torch.TorchMetalKernel that takes the routed indices as a kernel input and reads only the 4 routed experts' weight slabs → 3.6× faster (141 tok/s) at the same active-param bandwidth.

Bundles & honest quality

Shipped here (Mac-only):

dir	size	platform	decode tok/s	quality (fp32-oracle margin gate)
`gpu-pipelined/lfm2_5_8b_a1b_decode_sym8_gather/`	8.8 GB	Mac	140	CLEAN — +1 flip/41 (= fp16 ceiling) ✅

Honest bottom line. The sym8 (symmetric-linear int8) Mac bundle is both 3.6× faster AND clean — at the fp16 ceiling, matching the shipped int8-linear quality. The kernel itself is bit-exact; quality is purely the expert quantization scheme. An int4 bundle (4.7 GB) was validated to run on the iPhone 17 Pro (~32 tok/s, the first MoE on the phone) — but the iPhone needs int4 for size and non-QAT int4 is a hard quality wall (two independent 4-bit schemes both land at ~12 introduced flips/41 with large margins; clean int4 would need QAT weights LiquidAI doesn't ship). So only the clean Mac bundle is shipped; rebuild the int4 variant locally if you want the on-device version. On a bare prompt the base model itself greedy-degenerates into repetition (present in fp16 too) — use the chat template + sampling.

Run

COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/lfm2_5_8b_a1b_decode_sym8_gather -p 128 -g 256 -n 3

The decode graph's input_ids is static [1,1]; prefill runs as S=1 pipelined steps. Convert your own with conversion/export_lfm2_moe_metal_decode_pipelined.py (sym8 = clean Mac; int4km = iPhone-compact, not shipped).

License

LFM Open License v1.0 (upstream LiquidAI license, shipped as LICENSE). Conversion/kernel: community.

Downloads last month: 77

Model tree for mlboydaisuke/LFM2.5-8B-A1B-CoreAI

Base model

LiquidAI/LFM2.5-8B-A1B-Base

Finetuned

LiquidAI/LFM2.5-8B-A1B

Finetuned

(18)

this model

LFM2.5-8B-A1B — Core AI (the zoo's first MoE on iPhone)

The gather_qmm kernel

Bundles & honest quality

Run

License

Model tree for mlboydaisuke/LFM2.5-8B-A1B-CoreAI

The `gather_qmm` kernel