Chat template may re-inject prior-turn reasoning during multi-turn tool use → repetition loops

#38

by ManniX-ITA - opened 3 days ago

Summary

The chat_template shipped with this model may be vulnerable to a multi-turn
reasoning re-injection issue that can cause verbatim repetition loops during
agentic / tool-calling use (e.g. OpenCode, llama.cpp --jinja, any harness that sends
back reasoning_content on prior assistant tool-call steps).

It is harmless for single-turn chat — it only triggers in a multi-step tool-calling
sequence — which is why it does not show up in standard single-turn benchmarks.

Mechanism

The template renders each prior assistant turn's thinking back into the prompt:

{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
    {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
{%- endif -%}

The guard loop.index0 > last_user_idx and message.tool_calls restricts this to the
assistant tool-call steps that occur after the final user message — i.e. exactly the
in-request agentic loop. Each new step is therefore prompted with all of its own previous
private thoughts re-injected as <|channel>thought blocks. As the agentic context grows,
the model is fed an accumulating echo of its own reasoning and can collapse into a
repetition loop.

Scope

Detected by static inspection of the chat_template across the Gemma 4 instruct family
and several third-party requants — the re-injection block is present in all of them. We
have dynamically reproduced and fixed the loop on one pruned-MoE derivative; on other
sizes/quants the same code path is present but we have not measured loop severity directly,
hence "may be vulnerable."

Tests we ran (on our derivative)

Same engine / weights / seeds — only the rendered prompt varied:

Condition	Multi-turn agentic loop rate	HumanEval+ (Q6_K)	MultiPL-E-100 (Q6_K)
Stock template (re-injection on)	33% (4/12 seeds)	—	—
Fixed template (re-injection off)	0%	92.07%	0.66

Single-turn code/instruction scores are unchanged by the fix (the re-injection path is
multi-turn-only), so the fix carries no quality cost.

Fix

Disable the historical-reasoning re-injection (keep the thinking channel only for the
current generation). The minimal change is to make that {%- if ... -%} guard never fire:

{%- if false and thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}

A drop-in corrected template (validated as above) is here:
https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/chat_template.fixed.jinja

For GGUF files the template can be rewritten in-place without re-quantizing, via
gguf_new_metadata.py --chat-template-file chat_template.fixed.jinja (llama.cpp gguf-py).

A 1-minute unit test to check any repo / template

Runnable version: https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/template_loop_unittest.py

from transformers import AutoTokenizer

SENTINEL = "ZZ_HISTORY_THOUGHT_SENTINEL_ZZ"
# One user request, then a multi-step tool-calling sequence (real agentic shape).
conv = [
    {"role": "user", "content": "Build a small Three.js website with a rotating cube."},
    {"role": "assistant", "content": "", "reasoning_content": SENTINEL,
     "tool_calls": [{"type": "function", "function": {"name": "write_file",
       "arguments": {"path": "index.html", "content": "<html></html>"}}}]},
    {"role": "tool", "content": "index.html written"},
    {"role": "assistant", "content": "", "reasoning_content": "Now wire up the JS.",
     "tool_calls": [{"type": "function", "function": {"name": "write_file",
       "arguments": {"path": "main.js", "content": "//"}}}]},
    {"role": "tool", "content": "main.js written"},
]

tok = AutoTokenizer.from_pretrained("<this-repo-or-local-dir>")
rendered = tok.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
leaks = rendered.count(SENTINEL)
assert leaks == 0, f"MAY BE VULNERABLE: prior-turn reasoning re-injected {leaks}x as a thinking channel"
print("OK: no history-reasoning re-injection")

On the current template this asserts (sentinel re-injected); on the corrected template it
prints OK.

Hope this is useful — feel free to ignore if your harness never replays reasoning_content.

This is an automated message generated from a template audit across the Gemma 4 family
and its requants. It is shared for your awareness; no action is required on our part.

lucianommartins

Google org 3 days ago

I'm trying to fix that with this PR, @ManniX-ITA - can you take a look if it fixes your scenario? https://huggingface.co/google/gemma-4-12B-it/discussions/35

ManniX-ITA

3 days ago

@lucianommartins
I'll have a look and let you know, thanks!

ManniX-ITA

2 days ago

@lucianommartins
Nope, doesn't seem to fix it.
Did a quick re-run on the same automated test harness with opencode and preserve_thinking = true is just slightly better.
If false it's the same as the original template (expected, same behavior).

Is there a reason for re-emitting the prior-turn reasoning_content as <|channel>thought back into the multi-turn prompt?

I'm running a longer validation now but the first results are already telling.

ManniX-ITA

2 days ago

Completed the validation:

Results (failure = verbatim loop or runaway):

upstream main: 37.5% (9/24)
PR#35 as shipped (preserve_thinking=true): 29.2% (7/24) — still loops
PR#35 preserve_thinking=false: 37.5% (9/24) — identical to main, same failing seeds
re-injection fully disabled: 0% (0/24)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment