Gemma 4 12B (dense) β€” Core AI

Apple Core AI (.aimodel) conversion of Google's Gemma 4 12B dense text decoder, ported directly from the QAT release google/gemma-4-12B-it-qat-q4_0-unquantized. Decode-only, runs on the stock pipelined engine on Apple Silicon (M-series Macs).

First Core AI runtime for a β‰₯16-head Γ— head_dim-512 full-attention model. Gemma 4 12B's full (global) attention layers have a 16-head Γ— 512 Q tensor (16 KB fp16) that overflows MPSGraph's GPU decode scratch heap β€” the stock SDPA crashes at the first token (apple/coreai-models#27). These bundles ship a custom Metal flash-decode kernel on the full layers that removes the offending op, so the model runs. (The plain non-kernel bundles still crash β€” these _msdpa bundles are the runnable ones.)

Bundles (gpu-pipelined/)

bundle quant size decode (M4 Max) quality
gemma4_12b_qat_decode_int8lin_msdpa_g8 int8 (per-block-32) 14 GB ~23 tok/s (prefill 27.5) verified-clean: engine greedy == fp32 oracle
gemma4_12b_qat_decode_int4linsym_msdpa_g8 int4 (q4_0-aligned absmax) 8.2 GB ~33 tok/s (prefill 43.4) answers correctly, slightly 4-bit-lossy phrasing

The _g8 suffix is the higher-occupancy flash-decode kernel (8 SIMD-groups per head split the global layers' KV scan) β€” it holds throughput at long context (int8 decode 17.5 β†’ 20.3 tok/s at 1024 generated tokens vs the simple kernel) with identical numerics.

int8 is the verified-clean default (its teacher-forced greedy reproduces the fp32 oracle's "The capital of France is Paris." exactly). int4 is the faster / smaller option (16 GB-Mac accessible) at a small quality cost β€” the same precision class as MLX 4-bit, not a conversion bug (the int8 graph is exact).

Architecture

Clean dense gemma4_unified text decoder β€” no PLE / AltUp / Laurel / MoE / KV-sharing (unlike the on-device E2B/E4B siblings). 48 layers, hidden 3840, 16 heads, vocab 262144, final logit softcap 30, tied embeddings. 5:1 sliding:full interleave; dual head_dim (sliding 256 / full global_head_dim 512); full layers use a single global KV head with attention_k_eq_v (value = raw k_proj). Both attention shapes ride one growing KV pair, so the bundle loads on the stock CoreAIPipelinedEngine (2 states, no engine patch); the full layers' SDPA runs as a custom Metal flash-decode kernel.

Usage

Download a bundle and run with Apple's llm-runner / llm-benchmark (the pipelined engine; set COREAI_CHUNK_THRESHOLD=1):

huggingface-cli download mlboydaisuke/Gemma-4-12B-CoreAI \
    --include "gpu-pipelined/gemma4_12b_qat_decode_int8lin_msdpa_g8/*" \
    --local-dir ./gemma4-12b-coreai

COREAI_CHUNK_THRESHOLD=1 llm-runner \
    --model ./gemma4-12b-coreai/gpu-pipelined/gemma4_12b_qat_decode_int8lin_msdpa_g8 \
    --prompt "What is the capital of France?" --max-tokens 64 --chunk-size 1

Each bundle is self-contained: the .aimodel, metadata.json, and the Gemma tokenizer.

Conversion

Community zoo (recipe, overlays, model card): github.com/john-rocky/coreai-model-zoo β†’ zoo/gemma4-12b.md.

License

Gemma β€” governed by the Gemma Terms of Use. By using these weights you agree to those terms. The conversion (Core AI bundles, custom Metal kernel) adds no additional restrictions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Gemma-4-12B-CoreAI

Finetuned
(14)
this model