Gemma 4 12B IT โ€” Core AI (.aimodel), 4-bit, multimodal

google/gemma-4-12B-it converted to Apple's Core AI format (macOS 27 / iOS 27, WWDC 2026) with INT4 (block-32, weight-only) quantization of the decoder. Runs on Apple Silicon GPU via CoreAI.framework.

This is, to our knowledge, the first Gemma 4 conversion for Core AI. The gemma4_unified architecture (interleaved sliding/global attention with K==V global MQA, proportional partial RoPE, encoder-free multimodality) is not yet covered by Apple's coreai-models recipes; the conversion recipe used here is a custom extension of that toolkit.

Bundle contents

gemma_4_12b_it_mm_4bit.aimodel multi-function Core AI asset (~6.4 GB)
tokenizer/ HF tokenizer + chat template
metadata.json bundle metadata (0.2 schema) + multimodal constants

Functions

Function Inputs Output Purpose
main input_ids, position_ids (+ keyCache/valueCache states) logits text prefill + decode (causal, built-in sliding window)
prefill_multimodal input_ids, mm_embeds, mm_mask, position_ids, mask_sliding, mask_global (+ KV states) logits multimodal prefill; text embedding lookup happens in-graph, multimodal embeddings spliced via mm_mask
embed_vision pixel_values [1,P,6912], image_position_ids [1,P,2] embeds [1,P,3840] encoder-free image/video-frame embedder (48ร—48ร—3 merged patches)
embed_audio input_features [1,T,640] embeds [1,T,3840] raw 16 kHz audio frames (640 samples per token)

Attention masks are boolean (True = attend); a provided mask fully overrides the built-in causal/sliding-window behavior, so the multimodal prefill masks must encode causality + the 1024-token sliding window + the bidirectional attention within each image/video token block (mm_token_type_ids 1/2), mirroring HF Gemma4UnifiedModel.

Multimodal token constants (also in metadata.json): image 258880, audio 258881, video 258884, BOI 255999, EOI 258882; stop tokens 1, 106 (<end_of_turn>).

Performance (M-series Mac, GPU)

  • Decode: ~39โ€“46 tok/s ยท Prefill: ~290โ€“313 tok/s ยท Warm load: ~2 s
  • Text quality verified greedy + sampled; multimodal pipeline verified token-exact against the HF reference implementation (bf16 eager).

Usage

Drive with the CoreAILanguageModels Swift package from apple/coreai-models (text path works out of the box via main), or CoreAI.framework directly. The multimodal path requires client-side preprocessing that mirrors the HF Gemma4UnifiedProcessor: aspect-preserving resize (divisible by 48), 16 px patchify, 3ร—3 patch merge to 6912-dim patches with XY position ids, audio framing into 640-sample tokens, placeholder-token expansion, and the boolean masks described above.

Provenance & license

  • Source weights: google/gemma-4-12B-it, Apache 2.0.
  • Modifications: conversion to Core AI .aimodel via torch.export + coreai-torch; INT4 block-32 weight-only quantization of the decoder (embedders kept bf16); multi-function graph packaging described above.
  • This distribution remains under Apache 2.0. See LICENSE and NOTICE.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for warshanks/gemma-4-12B-it-coreai

Finetuned
(68)
this model