Gemma 4 E4B (text) — Apple Core AI (`.aimodel`)

Gemma 4 E4B's text decoder converted to Apple's Core AI (the Core ML successor announced at WWDC26), running on iOS 27 / macOS 27 via Apple's coreai-pipelined GPU engine — zero custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe).

Converted directly from Google's official QAT release google/gemma-4-E4B-it-qat-q4_0-unquantized: bf16 weights trained for q4_0 rounding, and q4_0 is this bundle's quantization class (per-block-32 absmax linear int4) — Google publishes these checkpoints as "preserving similar quality to bfloat16", so this int4 conversion carries that guarantee by design, not by post-hoc gating.

Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack: coreai-model-zoo — model card: zoo/gemma4-e4b.md.

Measured (greedy; M4 Max / iPhone 17 Pro, settled device)

config	files	size	M4 Max decode / prefill	iPhone decode / prefill
★ provider (runs BOTH platforms)	`gpu-pipelined/gemma4_e4b_qat_decode_int4lin/` + `ios-frontend/gemma4_e4b_qat_gather_raw/`	3.7 + 3.4 GB	53.2 / 62.6	15.1 / 21.3
★ provider, iPhone-ready AOT	`gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/` (precompiled `.aimodelc`, h18p = iPhone 17 Pro class only) + the same tables	3.7 + 3.4 GB	—	same as above — skip the AOT step
tbl (Mac-fastest)	`gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/` + the two `embed_per_layer.*` table files	3.7 + 2.7 GB	55.8 / 61.0	not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit)

On iPhone the working set stays tiny — measured peak footprint 2.2 GB (4.2 GB headroom): the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases land exactly on the bandwidth model (~2.1 GB int4/token).

What E4B is (config + checkpoint verified)

Clean dense model — no MoE. 42 layers (full attention every 6th), hidden 2560, intermediate 10240 uniform, 8 query heads / 2 KV heads, dual head_dim 256/512, 18 KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in ios-frontend/gemma4_e4b_qat_gather_raw/), final-logit softcap 30. The QAT checkpoint prunes the never-used KV projections on the shared layers — the zoo's loader handles both layouts.

Run contract (each item is load-bearing)

Full story + traps: pipelined-engine page.

Swift stack = apple/coreai-models + the zoo's patch stack (apps/*.patch, in order). The ★ provider bundle needs EngineOptions.perTokenInputProvider (coreai-pipelined-per-token-inputs.patch); the tbl bundle needs EngineOptions.staticInputBuffers (coreai-pipelined-static-inputs.patch).
Provider mode: per token, fill ple_tokens [1,1,42,256] fp16 from the table dump — row = i8[id] * scale[id] * sqrt(256), mmap-gathered (~0.1 ms). tbl mode: bind ple_table ← embed_per_layer.i8 and ple_scale ← embed_per_layer.scale.f32 as OWNED storageModeShared MTLBuffers (buffer-backing traps in the knowledge page).
COREAI_CHUNK_THRESHOLD=1 before engine creation; never call engine.warmup() (S=1 graph; a 1-token generate after load is the warmup).
iPhone: AOT is mandatory (the 3.7 GB-constants graph crashes the on-device specializer) — use the precompiled _aotc_h18p/ bundle, or xcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu --architecture h18p --expect-frequent-reshapes and point metadata.json's assets.main at the .aimodelc. Ship the com.apple.developer.kernel.increased-memory-limit entitlement as headroom insurance, and bench a settled device (a just-unlocked iPhone under-reads ~35%).

Reproduce from scratch (oracle + tables are checkpoint-derived — regenerate for any new weights): conversion/export_gemma4_decode_pipelined.py with --hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized.

License

Gemma is provided under and subject to the Gemma Terms of Use (https://ai.google.dev/gemma/terms). These .aimodel bundles are Model Derivatives of google/gemma-4-E4B-it-qat-q4_0-unquantized; by downloading or using them you agree to those terms, including the Gemma Prohibited Use Policy.

Sibling repo (E2B, incl. its own official-QAT bundles): gemma-4-E2B-CoreAI.

Downloads last month: 14

Model tree for mlboydaisuke/gemma-4-E4B-CoreAI

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Finetuned

google/gemma-4-E4B-it-qat-q4_0-unquantized

Finetuned

(10)

this model

Gemma 4 E4B (text) — Apple Core AI (.aimodel)

Measured (greedy; M4 Max / iPhone 17 Pro, settled device)

What E4B is (config + checkpoint verified)

Run contract (each item is load-bearing)

License

Model tree for mlboydaisuke/gemma-4-E4B-CoreAI

Gemma 4 E4B (text) — Apple Core AI (`.aimodel`)