FunctionGemma 270M — CoreML, 8-bit Palettized

A Core ML export of google/functiongemma-270m-it, optimized for the Apple Neural Engine on iOS 18 / macOS 15. The 18-layer transformer is reshaped into Apple's BC1S layout ((B, C, 1, T) channel-last with 1×1 Conv2d projections and per-head split attention) and the K/V cache lives in MLState slots, so token-by-token decode sends no tensor I/O back to the host.

Weights are quantized to 8-bit k-means palettization, lowered to constexpr_lut_to_dense ops the Neural Engine reads directly without runtime dequantization.

Model


Parameters	270M
Architecture	Gemma 3 (18 layers, 4 query heads, 1 KV head, head_dim 256, hidden 640, MLP 2048)
Quantization	8-bit k-means palettization (per-tensor codebook)
Format	Core ML `.mlmodelc` (ML Program, fp16 compute)
Cache layout	BC1S `MLState`, fixed cache length 128
Shapes	T_q ∈ {1, 128} via `EnumeratedShapes`
File size	257 MB model + 33 MB tokenizer ≈ 289 MB total
Min target	iOS 18 / macOS 15
Compute units	`cpuAndNeuralEngine` (required — CPU-only emulation diverges)

Files

File	Size	Description
`FunctionGemmaANEUnifiedStateful.mlmodelc/`	257 MB	Compiled Core ML model. Load with `MLModel(contentsOf:)`.
`config.json`	~2 KB	Architecture metadata (state names, input/output names, deployment target).
`chat_template.jinja`	~1 KB	Jinja chat template used by `tokenizer.apply_chat_template`.
`tokenizer.json`	~33 MB	Hugging Face `tokenizers` fast SentencePiece model.
`tokenizer_config.json`	~1 KB	Tokenizer settings.

Performance

Measured on Apple M-series Mac via cpuAndNeuralEngine, on the canonical "Convert 23 USD to EUR" tool-call prompt (91-token prompt → 31-token function call), warmed.

	Value
Prefill (128 tokens)	5.5 ms
Decode	3.98 ms/token (252 tok/s)
End-to-end (32 tokens)	~130 ms
Swift peak RSS (warm)	~37 MB private + ~510 MB mmap'd from disk (evictable)
Compute-plan device	96 %+ of ops prefer `neuralEngine`
Output parity vs fp16	Byte-identical on the tool-call grammar

Function-call quality on a diverse 7-prompt validation suite: produces syntactically valid <start_function_call> output on all 7 cases; matches fp16 fp32-reference on 5/7 (the 2 diverges are stylistic — picks "fr" over "french", rolls year into query string).

Usage

Swift (iOS 18 / macOS 15)

import CoreML

let url = URL(fileURLWithPath: "FunctionGemmaANEUnifiedStateful.mlmodelc")
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: url, configuration: config)
let state = model.makeState()

// Build prefill inputs (input_ids, cos/sin tables, attention mask,
// write_mask=ones, logits_mask one-hot at the last prompt position),
// then for decode call repeatedly with T_q=1 inputs and a one-hot
// write_mask at the current cache slot.
let output = try await model.prediction(from: prefillInputs, using: state)
let logits = output.featureValue(for: "logits")!.multiArrayValue!

The full prefill + decode driver is published as part of the speech-swift SDK.

Python (coremltools, macOS only)

import coremltools as ct
import numpy as np

model = ct.models.MLModel(
    "FunctionGemmaANEUnifiedStateful.mlpackage",
    compute_units=ct.ComputeUnit.CPU_AND_NE,
)
state = model.make_state()
# Build inputs as described above, then:
out = model.predict(prefill_inputs, state=state)
next_id = int(out["logits"][0].argmax())

Source

Upstream model: google/functiongemma-270m-it — Gemma 3 270M instruction-tuned for structured function calls.

Model tree for aufklarer/FunctionGemma-270M-CoreML-Palettize8

Base model

google/functiongemma-270m-it

Finetuned

(426)

this model

Collection including aufklarer/FunctionGemma-270M-CoreML-Palettize8

CoreML Speech Models

Collection

Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 28 items • Updated 1 day ago • 4

aufklarer
/

FunctionGemma-270M-CoreML-Palettize8