Gemma 4 12B IT โ Core AI (.aimodel), 4-bit, multimodal
google/gemma-4-12B-it converted to Apple's
Core AI format (macOS 27 / iOS 27, WWDC 2026) with INT4 (block-32, weight-only)
quantization of the decoder. Runs on Apple Silicon GPU via CoreAI.framework.
This is, to our knowledge, the first Gemma 4 conversion for Core AI. The
gemma4_unified architecture (interleaved sliding/global attention with K==V
global MQA, proportional partial RoPE, encoder-free multimodality) is not yet
covered by Apple's coreai-models
recipes; the conversion recipe used here is a custom extension of that toolkit.
Bundle contents
gemma_4_12b_it_mm_4bit.aimodel |
multi-function Core AI asset (~6.4 GB) |
tokenizer/ |
HF tokenizer + chat template |
metadata.json |
bundle metadata (0.2 schema) + multimodal constants |
Functions
| Function | Inputs | Output | Purpose |
|---|---|---|---|
main |
input_ids, position_ids (+ keyCache/valueCache states) |
logits |
text prefill + decode (causal, built-in sliding window) |
prefill_multimodal |
input_ids, mm_embeds, mm_mask, position_ids, mask_sliding, mask_global (+ KV states) |
logits |
multimodal prefill; text embedding lookup happens in-graph, multimodal embeddings spliced via mm_mask |
embed_vision |
pixel_values [1,P,6912], image_position_ids [1,P,2] |
embeds [1,P,3840] |
encoder-free image/video-frame embedder (48ร48ร3 merged patches) |
embed_audio |
input_features [1,T,640] |
embeds [1,T,3840] |
raw 16 kHz audio frames (640 samples per token) |
Attention masks are boolean (True = attend); a provided mask fully
overrides the built-in causal/sliding-window behavior, so the multimodal
prefill masks must encode causality + the 1024-token sliding window + the
bidirectional attention within each image/video token block
(mm_token_type_ids 1/2), mirroring HF Gemma4UnifiedModel.
Multimodal token constants (also in metadata.json): image 258880, audio
258881, video 258884, BOI 255999, EOI 258882; stop tokens 1, 106
(<end_of_turn>).
Performance (M-series Mac, GPU)
- Decode: ~39โ46 tok/s ยท Prefill: ~290โ313 tok/s ยท Warm load: ~2 s
- Text quality verified greedy + sampled; multimodal pipeline verified token-exact against the HF reference implementation (bf16 eager).
Usage
Drive with the CoreAILanguageModels Swift package from
apple/coreai-models (text path works
out of the box via main), or CoreAI.framework directly. The multimodal
path requires client-side preprocessing that mirrors the HF
Gemma4UnifiedProcessor: aspect-preserving resize (divisible by 48), 16 px
patchify, 3ร3 patch merge to 6912-dim patches with XY position ids, audio
framing into 640-sample tokens, placeholder-token expansion, and the boolean
masks described above.
Provenance & license
- Source weights: google/gemma-4-12B-it, Apache 2.0.
- Modifications: conversion to Core AI
.aimodelviatorch.export+coreai-torch; INT4 block-32 weight-only quantization of the decoder (embedders kept bf16); multi-function graph packaging described above. - This distribution remains under Apache 2.0. See
LICENSEandNOTICE.