Gemma 4 12B IT — Core AI (.aimodel), 4-bit, multimodal

google/gemma-4-12B-it converted to Apple's Core AI format (macOS 27 / iOS 27, WWDC 2026) with INT4 (block-32, weight-only) quantization of the decoder. Runs on Apple Silicon GPU via CoreAI.framework.

This is, to our knowledge, the first Gemma 4 conversion for Core AI. The gemma4_unified architecture (interleaved sliding/global attention with K==V global MQA, proportional partial RoPE, encoder-free multimodality) is not yet covered by Apple's coreai-models recipes; the conversion recipe used here is a custom extension of that toolkit.

Bundle contents


`gemma_4_12b_it_mm_4bit.aimodel`	multi-function Core AI asset (~6.4 GB)
`tokenizer/`	HF tokenizer + chat template
`metadata.json`	bundle metadata (0.2 schema) + multimodal constants

Functions

Function	Inputs	Output	Purpose
`main`	`input_ids`, `position_ids` (+ `keyCache`/`valueCache` states)	`logits`	text prefill + decode (causal, built-in sliding window)
`prefill_multimodal`	`input_ids`, `mm_embeds`, `mm_mask`, `position_ids`, `mask_sliding`, `mask_global` (+ KV states)	`logits`	multimodal prefill; text embedding lookup happens in-graph, multimodal embeddings spliced via `mm_mask`
`embed_vision`	`pixel_values [1,P,6912]`, `image_position_ids [1,P,2]`	`embeds [1,P,3840]`	encoder-free image/video-frame embedder (48×48×3 merged patches)
`embed_audio`	`input_features [1,T,640]`	`embeds [1,T,3840]`	raw 16 kHz audio frames (640 samples per token)

Attention masks are boolean (True = attend); a provided mask fully overrides the built-in causal/sliding-window behavior, so the multimodal prefill masks must encode causality + the 1024-token sliding window + the bidirectional attention within each image/video token block (mm_token_type_ids 1/2), mirroring HF Gemma4UnifiedModel.

Multimodal token constants (also in metadata.json): image 258880, audio 258881, video 258884, BOI 255999, EOI 258882; stop tokens 1, 106 (<end_of_turn>).

Performance (M-series Mac, GPU)

Decode: ~39–46 tok/s · Prefill: ~290–313 tok/s · Warm load: ~2 s
Text quality verified greedy + sampled; multimodal pipeline verified token-exact against the HF reference implementation (bf16 eager).

Usage

Drive with the CoreAILanguageModels Swift package from apple/coreai-models (text path works out of the box via main), or CoreAI.framework directly. The multimodal path requires client-side preprocessing that mirrors the HF Gemma4UnifiedProcessor: aspect-preserving resize (divisible by 48), 16 px patchify, 3×3 patch merge to 6912-dim patches with XY position ids, audio framing into 640-sample tokens, placeholder-token expansion, and the boolean masks described above.

Provenance & license

Source weights: google/gemma-4-12B-it, Apache 2.0.
Modifications: conversion to Core AI .aimodel via torch.export + coreai-torch; INT4 block-32 weight-only quantization of the decoder (embedders kept bf16); multi-function graph packaging described above.
This distribution remains under Apache 2.0. See LICENSE and NOTICE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for warshanks/gemma-4-12B-it-coreai

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Finetuned

(68)

this model