Gemma 4 E4B (text) β€” Apple Core AI (.aimodel)

Gemma 4 E4B's text decoder converted to Apple's Core AI (the Core ML successor announced at WWDC26), running on iOS 27 / macOS 27 via Apple's coreai-pipelined GPU engine β€” zero custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe).

Converted directly from Google's official QAT release google/gemma-4-E4B-it-qat-q4_0-unquantized: bf16 weights trained for q4_0 rounding, and q4_0 is this bundle's quantization class (per-block-32 absmax linear int4) β€” Google publishes these checkpoints as "preserving similar quality to bfloat16", so this int4 conversion carries that guarantee by design, not by post-hoc gating.

Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack: coreai-model-zoo β€” model card: zoo/gemma4-e4b.md.

Measured (greedy; M4 Max / iPhone 17 Pro, settled device)

config files size M4 Max decode / prefill iPhone decode / prefill
β˜… provider (runs BOTH platforms) gpu-pipelined/gemma4_e4b_qat_decode_int4lin/ + ios-frontend/gemma4_e4b_qat_gather_raw/ 3.7 + 3.4 GB 53.2 / 62.6 15.1 / 21.3
β˜… provider, iPhone-ready AOT gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/ (precompiled .aimodelc, h18p = iPhone 17 Pro class only) + the same tables 3.7 + 3.4 GB β€” same as above β€” skip the AOT step
tbl (Mac-fastest) gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/ + the two embed_per_layer.* table files 3.7 + 2.7 GB 55.8 / 61.0 not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit)

On iPhone the working set stays tiny β€” measured peak footprint 2.2 GB (4.2 GB headroom): the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases land exactly on the bandwidth model (~2.1 GB int4/token).

What E4B is (config + checkpoint verified)

Clean dense model β€” no MoE. 42 layers (full attention every 6th), hidden 2560, intermediate 10240 uniform, 8 query heads / 2 KV heads, dual head_dim 256/512, 18 KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in ios-frontend/gemma4_e4b_qat_gather_raw/), final-logit softcap 30. The QAT checkpoint prunes the never-used KV projections on the shared layers β€” the zoo's loader handles both layouts.

Run contract (each item is load-bearing)

Full story + traps: pipelined-engine page.

  1. Swift stack = apple/coreai-models + the zoo's patch stack (apps/*.patch, in order). The β˜… provider bundle needs EngineOptions.perTokenInputProvider (coreai-pipelined-per-token-inputs.patch); the tbl bundle needs EngineOptions.staticInputBuffers (coreai-pipelined-static-inputs.patch).
  2. Provider mode: per token, fill ple_tokens [1,1,42,256] fp16 from the table dump β€” row = i8[id] * scale[id] * sqrt(256), mmap-gathered (~0.1 ms). tbl mode: bind ple_table ← embed_per_layer.i8 and ple_scale ← embed_per_layer.scale.f32 as OWNED storageModeShared MTLBuffers (buffer-backing traps in the knowledge page).
  3. COREAI_CHUNK_THRESHOLD=1 before engine creation; never call engine.warmup() (S=1 graph; a 1-token generate after load is the warmup).
  4. iPhone: AOT is mandatory (the 3.7 GB-constants graph crashes the on-device specializer) β€” use the precompiled _aotc_h18p/ bundle, or xcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu --architecture h18p --expect-frequent-reshapes and point metadata.json's assets.main at the .aimodelc. Ship the com.apple.developer.kernel.increased-memory-limit entitlement as headroom insurance, and bench a settled device (a just-unlocked iPhone under-reads ~35%).

Reproduce from scratch (oracle + tables are checkpoint-derived β€” regenerate for any new weights): conversion/export_gemma4_decode_pipelined.py with --hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized.

License

Gemma is provided under and subject to the Gemma Terms of Use (https://ai.google.dev/gemma/terms). These .aimodel bundles are Model Derivatives of google/gemma-4-E4B-it-qat-q4_0-unquantized; by downloading or using them you agree to those terms, including the Gemma Prohibited Use Policy.

Sibling repo (E2B, incl. its own official-QAT bundles): gemma-4-E2B-CoreAI.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/gemma-4-E4B-CoreAI

Finetuned
(10)
this model