Instructions to use google/gemma-4-12B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-12B-it with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("google/gemma-4-12B-it") model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-12B-it") - Notebooks
- Google Colab
- Kaggle
Chat template may re-inject prior-turn reasoning during multi-turn tool use → repetition loops
Summary
The chat_template shipped with this model may be vulnerable to a multi-turn
reasoning re-injection issue that can cause verbatim repetition loops during
agentic / tool-calling use (e.g. OpenCode, llama.cpp --jinja, any harness that sends
back reasoning_content on prior assistant tool-call steps).
It is harmless for single-turn chat — it only triggers in a multi-step tool-calling
sequence — which is why it does not show up in standard single-turn benchmarks.
Mechanism
The template renders each prior assistant turn's thinking back into the prompt:
{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
{{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
{%- endif -%}
The guard loop.index0 > last_user_idx and message.tool_calls restricts this to the
assistant tool-call steps that occur after the final user message — i.e. exactly the
in-request agentic loop. Each new step is therefore prompted with all of its own previous
private thoughts re-injected as <|channel>thought blocks. As the agentic context grows,
the model is fed an accumulating echo of its own reasoning and can collapse into a
repetition loop.
Scope
Detected by static inspection of the chat_template across the Gemma 4 instruct family
and several third-party requants — the re-injection block is present in all of them. We
have dynamically reproduced and fixed the loop on one pruned-MoE derivative; on other
sizes/quants the same code path is present but we have not measured loop severity directly,
hence "may be vulnerable."
Tests we ran (on our derivative)
Same engine / weights / seeds — only the rendered prompt varied:
| Condition | Multi-turn agentic loop rate | HumanEval+ (Q6_K) | MultiPL-E-100 (Q6_K) |
|---|---|---|---|
| Stock template (re-injection on) | 33% (4/12 seeds) | — | — |
| Fixed template (re-injection off) | 0% | 92.07% | 0.66 |
Single-turn code/instruction scores are unchanged by the fix (the re-injection path is
multi-turn-only), so the fix carries no quality cost.
Fix
Disable the historical-reasoning re-injection (keep the thinking channel only for the
current generation). The minimal change is to make that {%- if ... -%} guard never fire:
{%- if false and thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
A drop-in corrected template (validated as above) is here:
https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/chat_template.fixed.jinja
For GGUF files the template can be rewritten in-place without re-quantizing, viagguf_new_metadata.py --chat-template-file chat_template.fixed.jinja (llama.cpp gguf-py).
A 1-minute unit test to check any repo / template
Runnable version: https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/template_loop_unittest.py
from transformers import AutoTokenizer
SENTINEL = "ZZ_HISTORY_THOUGHT_SENTINEL_ZZ"
# One user request, then a multi-step tool-calling sequence (real agentic shape).
conv = [
{"role": "user", "content": "Build a small Three.js website with a rotating cube."},
{"role": "assistant", "content": "", "reasoning_content": SENTINEL,
"tool_calls": [{"type": "function", "function": {"name": "write_file",
"arguments": {"path": "index.html", "content": "<html></html>"}}}]},
{"role": "tool", "content": "index.html written"},
{"role": "assistant", "content": "", "reasoning_content": "Now wire up the JS.",
"tool_calls": [{"type": "function", "function": {"name": "write_file",
"arguments": {"path": "main.js", "content": "//"}}}]},
{"role": "tool", "content": "main.js written"},
]
tok = AutoTokenizer.from_pretrained("<this-repo-or-local-dir>")
rendered = tok.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
leaks = rendered.count(SENTINEL)
assert leaks == 0, f"MAY BE VULNERABLE: prior-turn reasoning re-injected {leaks}x as a thinking channel"
print("OK: no history-reasoning re-injection")
On the current template this asserts (sentinel re-injected); on the corrected template it
prints OK.
Hope this is useful — feel free to ignore if your harness never replays reasoning_content.
This is an automated message generated from a template audit across the Gemma 4 family
and its requants. It is shared for your awareness; no action is required on our part.
I'm trying to fix that with this PR, @ManniX-ITA - can you take a look if it fixes your scenario? https://huggingface.co/google/gemma-4-12B-it/discussions/35
@lucianommartins
Nope, doesn't seem to fix it.
Did a quick re-run on the same automated test harness with opencode and preserve_thinking = true is just slightly better.
If false it's the same as the original template (expected, same behavior).
Is there a reason for re-emitting the prior-turn reasoning_content as <|channel>thought back into the multi-turn prompt?
I'm running a longer validation now but the first results are already telling.
Completed the validation:
Results (failure = verbatim loop or runaway):
- upstream main: 37.5% (9/24)
- PR#35 as shipped (preserve_thinking=true): 29.2% (7/24) — still loops
- PR#35 preserve_thinking=false: 37.5% (9/24) — identical to main, same failing seeds
- re-injection fully disabled: 0% (0/24)