LFM2.5-8B-A1B β€” Core AI (the zoo's first MoE on iPhone)

Apple Core AI (.aimodel) conversion of LiquidAI/LFM2.5-8B-A1B: a conv + full-attention MoE hybrid decoder (24 layers = 18 short-conv mixers + 6 GQA attention; hidden 2048, vocab 128k; first 2 layers dense, the rest 32-expert top-4 sparse MoE). 8.3B total / ~1.5B active per token.

Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo (full card: zoo/lfm2.5-8b-a1b-moe.md).

The gather_qmm kernel

MoE decode normally reads all 32 experts' weights every token via the GatherMM composite even though only the top-4 are routed β€” bandwidth-bound at 39 tok/s. This bundle uses a custom coreai_torch.TorchMetalKernel that takes the routed indices as a kernel input and reads only the 4 routed experts' weight slabs β†’ 3.6Γ— faster (141 tok/s) at the same active-param bandwidth.

Bundles & honest quality

Shipped here (Mac-only):

dir size platform decode tok/s quality (fp32-oracle margin gate)
gpu-pipelined/lfm2_5_8b_a1b_decode_sym8_gather/ 8.8 GB Mac 140 CLEAN β€” +1 flip/41 (= fp16 ceiling) βœ…

Honest bottom line. The sym8 (symmetric-linear int8) Mac bundle is both 3.6Γ— faster AND clean β€” at the fp16 ceiling, matching the shipped int8-linear quality. The kernel itself is bit-exact; quality is purely the expert quantization scheme. An int4 bundle (4.7 GB) was validated to run on the iPhone 17 Pro (~32 tok/s, the first MoE on the phone) β€” but the iPhone needs int4 for size and non-QAT int4 is a hard quality wall (two independent 4-bit schemes both land at ~12 introduced flips/41 with large margins; clean int4 would need QAT weights LiquidAI doesn't ship). So only the clean Mac bundle is shipped; rebuild the int4 variant locally if you want the on-device version. On a bare prompt the base model itself greedy-degenerates into repetition (present in fp16 too) β€” use the chat template + sampling.

Run

COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/lfm2_5_8b_a1b_decode_sym8_gather -p 128 -g 256 -n 3

The decode graph's input_ids is static [1,1]; prefill runs as S=1 pipelined steps. Convert your own with conversion/export_lfm2_moe_metal_decode_pipelined.py (sym8 = clean Mac; int4km = iPhone-compact, not shipped).

License

LFM Open License v1.0 (upstream LiquidAI license, shipped as LICENSE). Conversion/kernel: community.

Downloads last month
77
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/LFM2.5-8B-A1B-CoreAI

Finetuned
(18)
this model