Gemma 4 E2B (text) β€” Apple Core AI (.aimodel)

Gemma 4 E2B's text decoder converted to Apple's Core AI (the Core ML successor announced at WWDC26), ready to run on iOS 27 / macOS 27 β€” greedy 8/8 exact vs the Hugging Face reference on the iPhone GPU, the iPhone Neural Engine, and the Mac GPU. The GPU bundles embed custom fused int8/int4 Metal kernels inside the .aimodel (a Core AI feature); the ANE bundles are kernel-free and numerically hardened for fp16 NPU execution.

This repo publishes one set per platform Γ— compute-unit: the best verified configuration β€” each file is the exact artifact behind the published numbers, nothing experimental β€” plus the gpu-pipelined/ fast path: ONE kernel-free graph that is the fastest decode on BOTH Mac and iPhone (Apple's coreai-pipelined engine + the zoo's engine patch stack).

Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, Swift runner: coreai-model-zoo.

Pick your platform (measured: iPhone 17 Pro / M4 Max, greedy, 8/8 exact vs HF)

Category Files Size Decode
iOS GPU ios-frontend/gemma4_gather_raw/ + ios-gpu/gemma4_e2b_metal_int4km_L35.aimodel + ios-gpu/gemma4_e2b_head_argmax_int4km.aimodel 2.6 + 1.3 + 0.2 GB 22 tok/s
iOS ANE ios-frontend/gemma4_gather_raw/ + ios-ane/gemma4_e2b_hostcache_chunk{1..6}_int8.aimodel + ios-ane/gemma4_e2b_head_argmax_int8.aimodel (+ gemma4_chunks_plan.json) 2.6 + 1.8 + 0.4 GB 6 tok/s
macOS GPU macos/gemma4_e2b_frontend_int8.aimodel + macos/gemma4_e2b_metal_int8v3_L35.aimodel + macos/gemma4_e2b_head_argmax_kernel.aimodel 2.6 + 2.0 + 0.4 GB 56.6–59.0 tok/s (release build)
β˜… GPU pipelined (Mac + iOS) gpu-pipelined/gemma4_e2b_decode_int4lin_tbl/ + ios-frontend/gemma4_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32} 2.0 + 2.4 GB 77.0 tok/s (M4 Max) Β· 30.3 tok/s (iPhone 17 Pro, AOT)
β˜… GPU pipelined, iPhone-ready AOT gpu-pipelined/gemma4_e2b_decode_int4lin_tbl_aotc_h18p/ (precompiled .aimodelc, h18p = iPhone 17 Pro class only) + the same two gemma4_gather_raw table files 2.0 + 2.4 GB same as above on iPhone β€” skip the AOT step
β˜…β˜… GPU pipelined, official-QAT int4 gpu-pipelined/gemma4_e2b_qat_decode_int4lin_tbl/ (+ …_tbl_aotc_h18p/ precompiled) + ios-frontend/gemma4_qat_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32} (QAT bundles need the QAT tables) 2.0 + 2.4 GB 78.9 (M4 Max) Β· 30.7 (iPhone) β€” same speed, int4 β‰ˆ bf16 by design (see below)

| β˜…β˜…β˜… VISION (VL): image+text β†’ text | gpu-pipelined/gemma4_e2b_qat_vl_decode_int4linsym_tbl/ (Mac) or …_vl_decode_int4linsym/ + …_aotc_h18p/ (iPhone, provider+AOT) + gpu-pipelined/gemma4_e2b_qat_vl_vision/ + the QAT tables | 2.0 + 0.3 + 2.4 GB | 82.4 (M4 Max) Β· 25.5 (iPhone) β€” the text decoder + a 3-line image splice |

(ios-frontend/ is shared by both iPhone categories β€” download it once.)

Architecture is a 3-stage flow (Gemma 4's giant embedding/PLE tables stay out of the graph): frontend gather (mmap / int8 gather) β†’ 35-layer decode core β†’ 262k-vocab head(+argmax).

  • iOS GPU core = int4 k-means fused-kernel monolith (16-entry codebook staged in threadgroup memory, packed nibble loads); the head does the 262,144-vocab matvec and argmax in-kernel (returns (value,index) partials β€” no logits readback).
  • iOS ANE chunks = 6 fixed-shape chunks (the 35-layer monolith overflows the first-run ANE compile) with the two fp16 hardening fixes baked in: RMSNorm via the LayerNorm([x,βˆ’x]) identity (fp32-accumulating LN kernel) and Conv2d 1Γ—1 projections (fp32 conv-engine MACs).
  • macOS core = int8 k-means fused-kernel monolith (uint32-packed index loads).

β˜…β˜…β˜… Vision (Gemma 4 E2B VL) β€” image+text β†’ text

The same QAT checkpoint's vision path, riding the text decoder via the zoo's static-inputs patch β€” the image span is causal on E2B (verified vs the fp32 HF mask dump), so positions/masks/KV need nothing new:

  • gpu-pipelined/gemma4_e2b_qat_vl_vision/ β€” fixed-grid vision encoder, run once per image: patches [2304,768] f16 β†’ image_embeds [256,1536] (square 768Γ—768 = 48Γ—48 patches = 256 soft tokens; ~100–170 ms).
  • Decoder: Mac = gemma4_e2b_qat_vl_decode_int4linsym_tbl/ (tables in-graph, 95.2 prefill / 82.4 decode tok/s on M4 Max). iPhone = gemma4_e2b_qat_vl_decode_int4linsym{,_aotc_h18p}/ (provider mode β€” the tbl gather overflows an iOS per-encode scratch heap on this beta; 41.2 / 25.5 tok/s on iPhone 17 Pro, footprint 1.96 GB) + the ios-frontend/gemma4_qat_gather_raw/ tables.
  • Host contract: rewrite the prompt's 256 <image_soft_token> ids to extension ids V + slot, bind image_embeds [280,1536] as a static buffer (square fills rows 0..255); provider mode maps extension ids β†’ the PLE pad row. Quantization is plain absmax int4 (--lin-sym) β€” the QAT-q4_0 grid; clipping compounds errors at long contexts.

Numerics: Mac engine ≑ python gate 24/24 token-for-token; margin-ruled exact vs the fp32 HF oracle (a flip only where the oracle's top-2 gap < 0.1). Details + conversion script: zoo/gemma4-vl.md.

Run it

Python (macOS 27): load with coreai.runtime.AIModel on the GPU delegate (SpecializationOptions.from_preferred_compute_unit_kind(ComputeUnitKind.gpu())), drive frontend β†’ core β†’ head per token. Swift/device: push the set into your app sandbox (xcrun devicectl device copy to --domain-type appDataContainer). Walkthroughs + the burned-in gotchas: knowledge base Β· Swift runtime notes. Tokenizer: use the original google/gemma-4-E2B-it tokenizer files.

Two device gotchas (measured on the beta, 2026-06-10):

  1. Verify each multi-GB copy completed (xcrun devicectl device info files …) before the app's first load β€” loading a partially-copied .aimodel poisons the on-device specialization cache for that content hash (later loads fail ENOENT even after the copy finishes).
  2. Optional AOT: xcrun coreai-build compile <m>.aimodel --platform iOS --preferred-compute gpu --architecture h18p β†’ a .aimodelc that skips the on-device compile (first load ~4Γ— faster, decode tok/s identical to the plain .aimodel). The arch name follows the device identifier, not the marketing name: iPhone 17 Pro = iPhone18,1 β†’ h18p (an h17p build fails to load with invalidCompiledModel).

⚠️ Known beta issue affecting all Core AI LLMs (these bundles use the host-cache form that dodges it): the KV-write bug page (FB23024751 / apple/coreai-models#5).

β˜… GPU-pipelined fast path (zero custom kernels) β€” gpu-pipelined/

One decode-only S=1 LanguageBundle (input_ids [1,1] static, dynamic position/KV, embed + soft-capped head in-graph, and the 2.3 GB per-layer-embedding table as a STATIC graph input gathered in-graph by token id) rides Apple's coreai-pipelined engine: async non-blocking encode, on-GPU argmax, on-device KV growth. Measured (greedy; oracle 8/8, iPhone 24/24 token-identical to Mac-GPU): M4 Max 77.0 decode / 87.1 prefill Β· iPhone 17 Pro 30.3 / 38.9 β€” vs this repo's kernel monoliths (Mac 56.6–59, iPhone 22) with no Metal kernels at all.

Run contract (each item is load-bearing β€” full story + traps in the zoo's pipelined-engine page):

  1. Swift stack = apple/coreai-models + the zoo's 4-patch stack (apps/*.patch, applied in order) β€” this bundle needs the EngineOptions.staticInputBuffers hook from coreai-pipelined-static-inputs.patch.
  2. Bind the two table files (download from ios-frontend/gemma4_gather_raw/) as static inputs: ple_table ← embed_per_layer.i8, ple_scale ← embed_per_layer.scale.f32 β€” as OWNED storageModeShared MTLBuffers (read the file in once). A PROT_READ-only mmap under makeBuffer(bytesNoCopy:) silently costs ~65 ms/GB per encode on macOS; a writable COW mmap is fine on the Mac but pays a residency tax on iPhone.
  3. COREAI_CHUNK_THRESHOLD=1 before engine creation (prefill = pipelined S=1 steps); never call engine.warmup() (it warms shape 256; the S=1 graph rejects it) β€” a 1-token generate after load is the warmup.
  4. iPhone: AOT first β€” xcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu --architecture h18p --expect-frequent-reshapes, then point metadata.json's assets.main at the .aimodelc (on this beta the plain bundle passes on-device specialization but the spec'd artifact asserts at first execute) β€” or download the precompiled gpu-pipelined/gemma4_e2b_decode_int4lin_tbl_aotc_h18p/ (iPhone 17 Pro class). Ship the com.apple.developer.kernel.increased-memory-limit entitlement (the owned 2.35 GB table; measured peak footprint 4.4 GB vs a ~6.4 GB entitled limit) and bench a settled device (a just-unlocked iPhone under-reads ~35%).

In-app: the zoo's CoreAIChat ships this config as the Gemma ⚑ engine mode (GPU/ANE/⚑ segment) β€” it downloads the _aotc_h18p bundle plus the two table files and binds them as owned static buffers. Chat-surface on a settled iPhone 17 Pro: decode 32.7 / prefill 44.2 tok/s on a 200-token turn (vs 22 for the kernel-monolith GPU mode). First in-container load pays a one-time 2 GB spec-cache ingest (11 s engine load, ~6 s warm) and can invalidate sibling models' cached specializations once β€” the app's GEMMA_CLEAR_SPEC_CACHE=1 hook recovers.

The per-token-provider variant (PLE rows filled per step by a host callback β€” iPhone 26.5 decode / 40.5 prefill, no entitlement, clean mmap) is the lighter alternative; reproduce it from the same conversion script (conversion/export_gemma4_decode_pipelined.py, drop --tbl).

β˜…β˜… Official QAT weights β€” int4 quality guaranteed by design

gpu-pipelined/gemma4_e2b_qat_decode_int4lin_tbl/ (+ the _aotc_h18p/ precompile) is the same graph re-exported from Google's official QAT release google/gemma-4-E2B-it-qat-q4_0-unquantized: bf16 weights trained for q4_0 rounding, and q4_0 is this bundle's quantization (per-block-32 absmax-class linear int4). Google publishes these checkpoints as "preserving similar quality to bfloat16", explicitly for custom downstream compilation β€” so the int4 claim here upgrades from "PTQ that gates 8/8" to int4 β‰ˆ bf16 by design. Measured: same speed as the PTQ bundle (M4 Max 78.9 decode / 89.6 prefill; iPhone 17 Pro 30.7 / 36.7 settled; oracle 8/8 on python, engine, and device).

⚠️ Pair QAT bundles with the QAT tables: bind ios-frontend/gemma4_qat_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32} β€” the PLE table is checkpoint-derived, so the original gemma4_gather_raw/ files do NOT match these weights. Everything else (patch stack, chunk threshold, entitlement, AOT) is identical to the PTQ run contract above. Gemma 4 E4B (the bigger sibling, also from official QAT weights) lives in its own repo: gemma-4-E4B-CoreAI.

Parity

All three sets reproduce the HF eager greedy reference 8/8 top-1 exact ("What is the capital of France?" β†’ "The capital of France is Paris."), verified on macOS conversion and re-verified end-to-end on device per compute unit.

License

Gemma is provided under and subject to the Gemma Terms of Use (https://ai.google.dev/gemma/terms). These .aimodel bundles are Model Derivatives of google/gemma-4-E2B-it; by downloading or using them you agree to those terms, including the Gemma Prohibited Use Policy.

CoreML (iOS 18+) variants: gemma-4-E2B-coreml Β· gemma-4-E2B-stateful-coreml.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/gemma-4-E2B-CoreAI

Finetuned
(225)
this model