FunctionGemma 270M Mobile Exports

FunctionGemma is a Gemma 3 270M variant trained for local function calling. It is intended to translate user text into structured tool calls, then optionally turn the tool result into a short user-facing response.

Setup

Accept the google/functiongemma-270m-it license on Hugging Face, then authenticate before running conversion:

export HF_TOKEN=hf_...
cd models/functiongemma/export
poetry env use /opt/homebrew/bin/python3.11
poetry install --with convert

LiteRT-LM Export

poetry run python convert.py --output-dir ./functiongemma-litert

The default export uses LiteRT Torch's dynamic_wi8_afp32 quantization recipe, prefill lengths 128,512,1024, and a 1024 token KV cache. For a larger mobile prompt budget:

poetry run python convert.py \
  --output-dir ./functiongemma-litert \
  --cache-length 2048 \
  --prefill-lengths 128,512,1024,2048

Use --quantize none only for debugging.

The default quantized bundle is about 283 MB for model.litertlm; LiteRT may also create a local XNNPACK cache file next to it.

Validate

poetry run pytest test_function_calls.py test_litert.py -q
poetry run python smoke_litert.py

CoreML Export

poetry run python convert_coreml.py \
  --output-dir ./functiongemma-coreml \
  --compute-precision float32 \
  --quantize int8

The CoreML artifact is a fixed 128-token last-logits model. It uses int8 weights with float32 compute because the float16 compute export produced NaN logits in local validation. This CoreML path does full-context recompute for each generated token; LiteRT-LM remains the preferred production path for tool calling latency.

poetry run pytest test_coreml.py -q

Validated local bundle:

functiongemma-coreml/FunctionGemmaLastLogits.mlpackage
functiongemma-coreml/config.json
tokenizer files in functiongemma-coreml/

Benchmarks

poetry run python benchmark.py --backend litert --runs 5 --warmup 1
poetry run python benchmark.py --backend coreml --coreml-compute-units cpu --runs 5 --warmup 1
poetry run python benchmark.py --backend coreml --coreml-compute-units cpu_and_ne --runs 5 --warmup 1

Local results on this machine:

Backend	Quantization	Load RSS Δ	Peak RSS Δ	Mean tok/s
LiteRT-LM CPU	dynamic int8	551.1 MB	865.3 MB	148.54
CoreML CPU	int8 weights, fp32 compute	658.0 MB	1690.4 MB	31.49
CoreML CPU+NE	int8 weights, fp32 compute	86.7 MB	1129.8 MB	32.82

Runtime Loop

The model should be used in two passes:

Build a prompt with format_tool_call_prompt(...) and stop on <end_function_call> or <start_function_response>.
Parse the returned call with parse_function_calls(...), validate it against an allowlist, and execute the tool.
Build a second prompt with format_final_response_prompt(...) and stop on <end_of_turn> to get the final user-facing answer.

For command-only actions, the app can skip the second pass and present its own deterministic UI response after the tool succeeds.

FunctionGemma is trained for single-turn and parallel tool calls. Do not rely on it for multi-step dependency chains without app-side orchestration or fine-tuning.

The LiteRT-LM Python runtime currently returns FunctionGemma calls as raw text, for example:

<start_function_call>call:get_current_weather{location:<escape>Tokyo<escape>}<end_function_call>

Use parse_function_calls(...) to validate and dispatch the call. After the tool response is sent back as a tool_response turn, the same exported model can produce the final user-facing answer.

Mobile Artifacts

Ship these files:

functiongemma-litert/model.litertlm
functiongemma-litert/config.json

Do not ship local runtime caches such as model.litertlm.xnnpack_cache_*; they are regenerated by LiteRT.

Downloads last month: 47

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support