Gemma-4 Coder β€” tool-call recovery shim

A tiny, framework-free post-processor that recovers structured tool_calls from gemma-4 12B coder GGUFs served on llama.cpp.

Why you need it

llama.cpp --jinja does not recognise gemma-4's native tool-call format, so the bare parser under-reports tool calls. The model is fine β€” the parser is blind to the format. The model emits a call as text in content:

<|tool_call>call:get_weather{"city": "Paris", "units": "celsius"}

This shim re-parses that (and the leaked variants different builds produce β€” <|tool>NAME{…}, <call:NAME{…}>, NAME(k=v)<|/tool|>) back into the OpenAI shape your client expects. It never touches the weights and adds nothing beyond a regex scan.

Files

File What
gemma_tool_parse.py the pure-stdlib core: call detection, tolerant arg parsing, content cleanup, and the serve-side tools-in-prompt block. No dependencies.
standalone.py use it without any framework β€” call recovered_tool_calls(text) on a completion. Runnable demo.
litellm_shim.py a drop-in litellm CustomLogger callback (pre-call folds tool defs + stop-strings; post-call recovers tool_calls).

The algorithm (re-implement in any language)

  1. Scan the assistant content for the native marker <|tool_call> and the leaked text forms (<|tool>NAME{…}, <call:NAME{…}>, a bare NAME{…}/NAME(k=v) anchored on a tool marker). Names may be dotted (weather.get_weather).
  2. Extract per match the tool name and the arguments object (balanced-brace scan from the first {; tolerate trailing prose, fancy quotes, bare keys, = or :).
  3. Coerce scalar arg values ("3"β†’3, "true"β†’true) so the downstream tool validates; leave unknown fields untouched.
  4. Emit OpenAI tool_calls [{"id","type":"function","function":{"name","arguments":<json-string>}}], set finish_reason: "tool_calls", and strip the recovered markup from content.
  5. Pass through unchanged when no marker is found (plain answers are unaffected).

A second concern, also handled: the model was fine-tuned with tool definitions folded into the prompt, so it only behaves if it sees the same block at serve time. Call fold_tools_prompt(messages, tools) (in gemma_tool_parse.py) before sending.

Usage

Standalone (no litellm)

from standalone import recovered_tool_calls

tool_calls, content = recovered_tool_calls(assistant_text)

litellm

# config.yaml
litellm_settings:
  callbacks: ["litellm_shim.shim"]

Set GEMMA_SHIM_MODELS=my-deploy-name to gate the shim to specific deployments (default: all). litellm hook signatures vary across releases β€” litellm_shim.py is a reference adapter; adjust the method names to your version if needed.

Models that use this

Apache-2.0 β€” use it freely.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support