Gemma 4 E4B (text) β Apple Core AI (.aimodel)
Gemma 4 E4B's text decoder converted to Apple's Core AI (the Core ML successor announced
at WWDC26), running on iOS 27 / macOS 27 via Apple's coreai-pipelined GPU engine β zero
custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and
the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe).
Converted directly from Google's official QAT release google/gemma-4-E4B-it-qat-q4_0-unquantized: bf16 weights trained for q4_0 rounding, and q4_0 is this bundle's quantization class (per-block-32 absmax linear int4) β Google publishes these checkpoints as "preserving similar quality to bfloat16", so this int4 conversion carries that guarantee by design, not by post-hoc gating.
Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack: coreai-model-zoo β model card:
zoo/gemma4-e4b.md.
Measured (greedy; M4 Max / iPhone 17 Pro, settled device)
| config | files | size | M4 Max decode / prefill | iPhone decode / prefill |
|---|---|---|---|---|
| β provider (runs BOTH platforms) | gpu-pipelined/gemma4_e4b_qat_decode_int4lin/ + ios-frontend/gemma4_e4b_qat_gather_raw/ |
3.7 + 3.4 GB | 53.2 / 62.6 | 15.1 / 21.3 |
| β provider, iPhone-ready AOT | gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/ (precompiled .aimodelc, h18p = iPhone 17 Pro class only) + the same tables |
3.7 + 3.4 GB | β | same as above β skip the AOT step |
| tbl (Mac-fastest) | gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/ + the two embed_per_layer.* table files |
3.7 + 2.7 GB | 55.8 / 61.0 | not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit) |
On iPhone the working set stays tiny β measured peak footprint 2.2 GB (4.2 GB headroom): the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases land exactly on the bandwidth model (~2.1 GB int4/token).
What E4B is (config + checkpoint verified)
Clean dense model β no MoE. 42 layers (full attention every 6th), hidden 2560,
intermediate 10240 uniform, 8 query heads / 2 KV heads, dual head_dim 256/512, 18
KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded
KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in
ios-frontend/gemma4_e4b_qat_gather_raw/), final-logit softcap 30. The QAT checkpoint prunes
the never-used KV projections on the shared layers β the zoo's loader handles both layouts.
Run contract (each item is load-bearing)
Full story + traps: pipelined-engine page.
- Swift stack =
apple/coreai-models+ the zoo's patch stack (apps/*.patch, in order). The β provider bundle needsEngineOptions.perTokenInputProvider(coreai-pipelined-per-token-inputs.patch); the tbl bundle needsEngineOptions.staticInputBuffers(coreai-pipelined-static-inputs.patch). - Provider mode: per token, fill
ple_tokens [1,1,42,256]fp16 from the table dump βrow = i8[id] * scale[id] * sqrt(256), mmap-gathered (~0.1 ms). tbl mode: bindple_tableβembed_per_layer.i8andple_scaleβembed_per_layer.scale.f32as OWNEDstorageModeSharedMTLBuffers (buffer-backing traps in the knowledge page). COREAI_CHUNK_THRESHOLD=1before engine creation; never callengine.warmup()(S=1 graph; a 1-token generate after load is the warmup).- iPhone: AOT is mandatory (the 3.7 GB-constants graph crashes the on-device
specializer) β use the precompiled
_aotc_h18p/bundle, orxcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu --architecture h18p --expect-frequent-reshapesand pointmetadata.json'sassets.mainat the.aimodelc. Ship thecom.apple.developer.kernel.increased-memory-limitentitlement as headroom insurance, and bench a settled device (a just-unlocked iPhone under-reads ~35%).
Reproduce from scratch (oracle + tables are checkpoint-derived β regenerate for any new
weights): conversion/export_gemma4_decode_pipelined.py
with --hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized.
License
Gemma is provided under and subject to the Gemma Terms of Use
(https://ai.google.dev/gemma/terms). These .aimodel bundles are Model Derivatives of
google/gemma-4-E4B-it-qat-q4_0-unquantized;
by downloading or using them you agree to those terms, including the
Gemma Prohibited Use Policy.
Sibling repo (E2B, incl. its own official-QAT bundles): gemma-4-E2B-CoreAI.
- Downloads last month
- 14
Model tree for mlboydaisuke/gemma-4-E4B-CoreAI
Base model
google/gemma-4-E4B