Gemma 4 12B (dense) β Core AI
Apple Core AI (.aimodel) conversion of Google's Gemma 4 12B dense text decoder,
ported directly from the QAT release
google/gemma-4-12B-it-qat-q4_0-unquantized.
Decode-only, runs on the stock pipelined engine on Apple Silicon (M-series Macs).
First Core AI runtime for a β₯16-head Γ head_dim-512 full-attention model. Gemma 4 12B's full (global) attention layers have a 16-head Γ 512 Q tensor (16 KB fp16) that overflows MPSGraph's GPU decode scratch heap β the stock SDPA crashes at the first token (apple/coreai-models#27). These bundles ship a custom Metal flash-decode kernel on the full layers that removes the offending op, so the model runs. (The plain non-kernel bundles still crash β these
_msdpabundles are the runnable ones.)
Bundles (gpu-pipelined/)
| bundle | quant | size | decode (M4 Max) | quality |
|---|---|---|---|---|
gemma4_12b_qat_decode_int8lin_msdpa_g8 |
int8 (per-block-32) | 14 GB | ~23 tok/s (prefill 27.5) | verified-clean: engine greedy == fp32 oracle |
gemma4_12b_qat_decode_int4linsym_msdpa_g8 |
int4 (q4_0-aligned absmax) | 8.2 GB | ~33 tok/s (prefill 43.4) | answers correctly, slightly 4-bit-lossy phrasing |
The _g8 suffix is the higher-occupancy flash-decode kernel (8 SIMD-groups per head split the
global layers' KV scan) β it holds throughput at long context (int8 decode 17.5 β 20.3 tok/s at
1024 generated tokens vs the simple kernel) with identical numerics.
int8 is the verified-clean default (its teacher-forced greedy reproduces the fp32 oracle's "The capital of France is Paris." exactly). int4 is the faster / smaller option (16 GB-Mac accessible) at a small quality cost β the same precision class as MLX 4-bit, not a conversion bug (the int8 graph is exact).
Architecture
Clean dense gemma4_unified text decoder β no PLE / AltUp / Laurel / MoE / KV-sharing
(unlike the on-device E2B/E4B siblings). 48 layers, hidden 3840, 16 heads, vocab 262144, final
logit softcap 30, tied embeddings. 5:1 sliding:full interleave; dual head_dim (sliding 256 / full
global_head_dim 512); full layers use a single global KV head with attention_k_eq_v (value =
raw k_proj). Both attention shapes ride one growing KV pair, so the bundle loads on the stock
CoreAIPipelinedEngine (2 states, no engine patch); the full layers' SDPA runs as a custom Metal
flash-decode kernel.
Usage
Download a bundle and run with Apple's llm-runner / llm-benchmark (the pipelined engine; set
COREAI_CHUNK_THRESHOLD=1):
huggingface-cli download mlboydaisuke/Gemma-4-12B-CoreAI \
--include "gpu-pipelined/gemma4_12b_qat_decode_int8lin_msdpa_g8/*" \
--local-dir ./gemma4-12b-coreai
COREAI_CHUNK_THRESHOLD=1 llm-runner \
--model ./gemma4-12b-coreai/gpu-pipelined/gemma4_12b_qat_decode_int8lin_msdpa_g8 \
--prompt "What is the capital of France?" --max-tokens 64 --chunk-size 1
Each bundle is self-contained: the .aimodel, metadata.json, and the Gemma tokenizer.
Conversion
Community zoo (recipe, overlays, model card):
github.com/john-rocky/coreai-model-zoo β zoo/gemma4-12b.md.
License
Gemma β governed by the Gemma Terms of Use. By using these weights you agree to those terms. The conversion (Core AI bundles, custom Metal kernel) adds no additional restrictions.
Model tree for mlboydaisuke/Gemma-4-12B-CoreAI
Base model
google/gemma-4-12B