Chat template may re-inject prior-turn reasoning during multi-turn tool use → repetition loops

#48
by ManniX-ITA - opened

Summary

The chat_template shipped with this model may be vulnerable to a multi-turn
reasoning re-injection issue that can cause verbatim repetition loops during
agentic / tool-calling use (e.g. OpenCode, llama.cpp --jinja, any harness that sends
back reasoning_content on prior assistant tool-call steps).

It is harmless for single-turn chat — it only triggers in a multi-step tool-calling
sequence — which is why it does not show up in standard single-turn benchmarks.

Mechanism

The template renders each prior assistant turn's thinking back into the prompt:

{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
    {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
{%- endif -%}

The guard loop.index0 > last_user_idx and message.tool_calls restricts this to the
assistant tool-call steps that occur after the final user message — i.e. exactly the
in-request agentic loop. Each new step is therefore prompted with all of its own previous
private thoughts re-injected as <|channel>thought blocks. As the agentic context grows,
the model is fed an accumulating echo of its own reasoning and can collapse into a
repetition loop.

Scope

Detected by static inspection of the chat_template across the Gemma 4 instruct family
and several third-party requants — the re-injection block is present in all of them. We
have dynamically reproduced and fixed the loop on one pruned-MoE derivative; on other
sizes/quants the same code path is present but we have not measured loop severity directly,
hence "may be vulnerable."

Tests we ran (on our derivative)

Same engine / weights / seeds — only the rendered prompt varied:

Condition Multi-turn agentic loop rate HumanEval+ (Q6_K) MultiPL-E-100 (Q6_K)
Stock template (re-injection on) 33% (4/12 seeds)
Fixed template (re-injection off) 0% 92.07% 0.66

Single-turn code/instruction scores are unchanged by the fix (the re-injection path is
multi-turn-only), so the fix carries no quality cost.

Fix

Disable the historical-reasoning re-injection (keep the thinking channel only for the
current generation). The minimal change is to make that {%- if ... -%} guard never fire:

{%- if false and thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}

A drop-in corrected template (validated as above) is here:
https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/chat_template.fixed.jinja

For GGUF files the template can be rewritten in-place without re-quantizing, via
gguf_new_metadata.py --chat-template-file chat_template.fixed.jinja (llama.cpp gguf-py).

A 1-minute unit test to check any repo / template

Runnable version: https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/template_loop_unittest.py

from transformers import AutoTokenizer

SENTINEL = "ZZ_HISTORY_THOUGHT_SENTINEL_ZZ"
# One user request, then a multi-step tool-calling sequence (real agentic shape).
conv = [
    {"role": "user", "content": "Build a small Three.js website with a rotating cube."},
    {"role": "assistant", "content": "", "reasoning_content": SENTINEL,
     "tool_calls": [{"type": "function", "function": {"name": "write_file",
       "arguments": {"path": "index.html", "content": "<html></html>"}}}]},
    {"role": "tool", "content": "index.html written"},
    {"role": "assistant", "content": "", "reasoning_content": "Now wire up the JS.",
     "tool_calls": [{"type": "function", "function": {"name": "write_file",
       "arguments": {"path": "main.js", "content": "//"}}}]},
    {"role": "tool", "content": "main.js written"},
]

tok = AutoTokenizer.from_pretrained("<this-repo-or-local-dir>")
rendered = tok.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
leaks = rendered.count(SENTINEL)
assert leaks == 0, f"MAY BE VULNERABLE: prior-turn reasoning re-injected {leaks}x as a thinking channel"
print("OK: no history-reasoning re-injection")

On the current template this asserts (sentinel re-injected); on the corrected template it
prints OK.

Hope this is useful — feel free to ignore if your harness never replays reasoning_content.


This is an automated message generated from a template audit across the Gemma 4 family
and its requants. It is shared for your awareness; no action is required on our part.

Massive thanks! I've been hitting these repetition loops for weeks, and this fix for the Jinja template is the first one that actually works. This should definitely be included in the official chat template. Much appreciated!

Sign up or log in to comment