FunctionGemma 270M — CoreML, 8-bit Palettized
A Core ML export of google/functiongemma-270m-it,
optimized for the Apple Neural Engine on iOS 18 / macOS 15. The 18-layer
transformer is reshaped into Apple's BC1S layout ((B, C, 1, T)
channel-last with 1×1 Conv2d projections and per-head split attention)
and the K/V cache lives in MLState slots, so token-by-token decode
sends no tensor I/O back to the host.
Weights are quantized to 8-bit k-means palettization, lowered to
constexpr_lut_to_dense ops the Neural Engine reads directly without
runtime dequantization.
Model
| Parameters | 270M |
| Architecture | Gemma 3 (18 layers, 4 query heads, 1 KV head, head_dim 256, hidden 640, MLP 2048) |
| Quantization | 8-bit k-means palettization (per-tensor codebook) |
| Format | Core ML .mlmodelc (ML Program, fp16 compute) |
| Cache layout | BC1S MLState, fixed cache length 128 |
| Shapes | T_q ∈ {1, 128} via EnumeratedShapes |
| File size | 257 MB model + 33 MB tokenizer ≈ 289 MB total |
| Min target | iOS 18 / macOS 15 |
| Compute units | cpuAndNeuralEngine (required — CPU-only emulation diverges) |
Files
| File | Size | Description |
|---|---|---|
FunctionGemmaANEUnifiedStateful.mlmodelc/ |
257 MB | Compiled Core ML model. Load with MLModel(contentsOf:). |
config.json |
~2 KB | Architecture metadata (state names, input/output names, deployment target). |
chat_template.jinja |
~1 KB | Jinja chat template used by tokenizer.apply_chat_template. |
tokenizer.json |
~33 MB | Hugging Face tokenizers fast SentencePiece model. |
tokenizer_config.json |
~1 KB | Tokenizer settings. |
Performance
Measured on Apple M-series Mac via cpuAndNeuralEngine, on the canonical
"Convert 23 USD to EUR" tool-call prompt (91-token prompt → 31-token
function call), warmed.
| Value | |
|---|---|
| Prefill (128 tokens) | 5.5 ms |
| Decode | 3.98 ms/token (252 tok/s) |
| End-to-end (32 tokens) | ~130 ms |
| Swift peak RSS (warm) | ~37 MB private + ~510 MB mmap'd from disk (evictable) |
| Compute-plan device | 96 %+ of ops prefer neuralEngine |
| Output parity vs fp16 | Byte-identical on the tool-call grammar |
Function-call quality on a diverse 7-prompt validation suite: produces
syntactically valid <start_function_call> output on all 7 cases;
matches fp16 fp32-reference on 5/7 (the 2 diverges are stylistic — picks
"fr" over "french", rolls year into query string).
Usage
Swift (iOS 18 / macOS 15)
import CoreML
let url = URL(fileURLWithPath: "FunctionGemmaANEUnifiedStateful.mlmodelc")
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: url, configuration: config)
let state = model.makeState()
// Build prefill inputs (input_ids, cos/sin tables, attention mask,
// write_mask=ones, logits_mask one-hot at the last prompt position),
// then for decode call repeatedly with T_q=1 inputs and a one-hot
// write_mask at the current cache slot.
let output = try await model.prediction(from: prefillInputs, using: state)
let logits = output.featureValue(for: "logits")!.multiArrayValue!
The full prefill + decode driver is published as part of the speech-swift SDK.
Python (coremltools, macOS only)
import coremltools as ct
import numpy as np
model = ct.models.MLModel(
"FunctionGemmaANEUnifiedStateful.mlpackage",
compute_units=ct.ComputeUnit.CPU_AND_NE,
)
state = model.make_state()
# Build inputs as described above, then:
out = model.predict(prefill_inputs, state=state)
next_id = int(out["logits"][0].argmax())
Source
Upstream model: google/functiongemma-270m-it — Gemma 3 270M instruction-tuned for structured function calls.
Links
- speech-swift — Apple SDK
- soniqo.audio — website
- blog
- Downloads last month
- 19
Model tree for aufklarer/FunctionGemma-270M-CoreML-Palettize8
Base model
google/functiongemma-270m-it