gemma-4-12B-it: deterministic "thought\n thought\n …" degenerate loop on long agent prompts (~60% reproducer, 4-bit, repros across temperatures)

#41

by Raullen - opened 8 days ago

Hi Gemma team — wanted to surface a reliable model-behavior reproducer in case it's useful for the next instruction-tune iteration. This is a heads-up / data report, not a bug report against any specific runtime.

Symptom. When gemma-4-12B-it receives a long agent-style system prompt (~7,800 tokens) with multiple tool definitions (10 tools, OpenAI function-calling JSON schemas — exec_command, write_stdin, update_plan, etc.) and a user message asking the model to use one of those tools, the model enters a degenerate generation loop in roughly 3 of 5 trials:

content (first 200 chars): 'thought\nthought\nthought\nthought\n…'  (× ~1000)
finish_reason: 'length'
tool_calls: []

The remaining 2/5 trials emit a clean tool call (exec_command({"cmd": "head -n 1 pyproject.toml"})) within ~1 s, so the model is clearly capable of solving the task — it just falls into the thought\n attractor often enough to be a practical reliability blocker.

Reproduction. All from the model card's recommended sampling (T=1.0, top_p=0.95, top_k=64, per generation_config.json). The attractor is essentially deterministic; sampling variance only controls how often the model escapes.

Tested quant: mlx-community/gemma-4-12B-it-4bit (Apple-Silicon MLX, 4-bit)
Control model: Qwen3.5-9B-4bit on the exact same payload → 5/5 clean tool calls in 1-8 s

Investigated and ruled out:

Temperature: looped at T=0.0, 0.7, 1.0
top_k: 0 / 40 / 64 / None
Prompt cache: failures occur with both cold and warm cache
Prompt phrasing: identical Codex-CLI system prompt and tool schemas as Qwen3.5 (which passes 5/5)

Why it looks like an attractor, not noise. The first few tokens after the user's task are consistently <|channel>thought-prefixed reasoning, and once the model commits to a reasoning chunk it tends to emit the literal token thought\n over and over instead of completing the chunk and switching to a function call. The 2 successful trials skip the reasoning channel entirely and go straight to <|tool_call>call:exec_command{…}<tool_call|>.

Smoking-gun snippet (non-streaming, /v1/chat/completions-equivalent generation):

content first 200: 'thought\nthought\nthought\nthought\n…' (× ~1000)
finish_reason: 'length'

What I'd be curious about: is the thought\n loop reproducible at fp16 / bf16 on the original release weights too, or is this attractor introduced/sharpened by 4-bit quantization? Happy to share the captured 47 KB request payload and the per-token logits trace from the looping segment if useful.

For reference / so this isn't duplicated: I've also tracked this downstream as a known-issue note in our runtime — https://github.com/raullenchai/Rapid-MLX/issues/686 — but the fix surface is in the model, not the runtime (the same 4-bit weights with simpler prompts work fine).

Thanks for the model — looking forward to v4.5/5 :)

thnamratha

Google org 5 days ago

Hi @Raullen ,

Thanks for addressing the issue and providing in detailed report. To help us reproduce and for further investigation, could you please share us:

47KB Payload request
Per-token logit traces
Exact-Prompt Format configuration

ManniX-ITA

4 days ago

Hey @thnamratha
Will share with Claude my agentic loop test harness.
Gemma4 A4B passes it with zero loops or runaways.

ManniX-ITA

4 days ago

Hi @thnamratha — here's the agentic-loop test harness I mentioned, packaged as a self-contained, pip-installable tool in our public repo so you can reproduce this on your side. It installs and runs independently of the rest of the project — its own requirements.txt / install.sh, no shared imports; the only third-party dependency is PyYAML, the core is pure standard library. Everything to download, install, and run it is below.

What it does (one line): it replays a frozen agentic coding conversation — the exact messages/tools captured right before a model looped — against a chat server across seeds × a sampler matrix, and reports a per-cell fail rate (fails = thinking-channel loops + answer-channel runaways). You vary only the chat template between cells, so the difference in fail rate is that template's effect — which directly answers "does template X stop the loop?".

1. Get it

git clone https://github.com/mann1x/omnimergekit.git
cd omnimergekit/tools/agentic-loop-harness

2. Install — three modes, pick one

# A) build a pinned CUDA llama-server from source (default)
./install.sh --mode build --cuda-arch 120        # 120=Blackwell, 90=Hopper, 89=Ada, 86=Ampere/3090

# B) use a llama-server binary you already have
./install.sh --mode byo-binary --llama-server-bin /path/to/llama-server

# C) drive an already-running OpenAI-compatible endpoint (vLLM, a gateway, …)
./install.sh --mode byo-endpoint --endpoint http://127.0.0.1:8000

All three create a local .venv and pip install -e .. Then:

source .env        # exports LLAMA_SERVER_BIN (modes A/B) or the endpoint (mode C)

3. Point it at a model (Gemma-4-12B, F16 GGUF)

hf download google/gemma-4-12B-it --local-dir gemma-4-12B-it
python .llama.cpp/convert_hf_to_gguf.py gemma-4-12B-it \
    --outfile gemma-4-12B-it-F16.gguf --outtype f16

Then set model.gguf: in the profile to that file. (We use F16 to remove quantization as a confound when studying the template; any quant works too.)

4. Run

agentic-loop-harness --profile profiles/gemma4.example.yaml
# or:  python -m agentic_loop_harness --profile profiles/gemma4.example.yaml

Everything under test lives in that one profile YAML:

model.chat_template — a list of templates to compare, one cell each. The shipped example is the 4-cell comparison: embedded (the GGUF's own template), reinject-off (history-turn reasoning re-injection disabled), pr35 (your PR #35, default mode), pr35-ptfalse (your PR #35, pass-thinking=false).
server.reasoning_format / reasoning_budget, sampling.matrix, run.seeds, run.fixtures — set these to your own deployment settings. Drop in your own .jinja templates and your own sampler; the harness does the rest.

A per-template fail-rate table prints at the end, and per-seed detail (which channel looped, the repeating unit, generated lengths, finish_reason) lands as JSON in run.out_dir.

It's backend-agnostic: replay.py speaks only OpenAI /v1/chat/completions (streaming), so mode C works against vLLM or any compatible gateway (one endpoint per template, since the template is then fixed server-side).

I'll follow up in this thread with our first measurement round — the full methodology, the exact sampler and llama.cpp flags, the build, and the hardware/software details — as soon as our runs complete.

ManniX-ITA

4 days ago

Hi @thnamratha — here's the first measurement round I promised, run with the harness above.

I deliberately tested gemma-4-12B-it at F16 to take quantization off the table: @Raullen 's original reproducer was 4-bit MLX, and the open question was whether the thought\n attractor is introduced/sharpened by 4-bit or already present in the release weights.

TL;DR — the loop reproduces at F16. It is not a quantization artifact: the full-precision release weights fall into the same thought\n… attractor on long agent prompts. The chat template modulates the rate but no template eliminates it.

Setup (the methodology you asked for)

Model: google/gemma-4-12B-it → F16 GGUF (convert_hf_to_gguf.py --outtype f16). No quantization.
Backend: llama.cpp llama-server (commit 9724f664e), 1× RTX PRO 6000 (Blackwell).
Server flags: -ngl 99 -c 600000 --parallel 5 -fa on -ctk q8_0 -ctv q8_0 --reasoning-format deepseek --reasoning-budget 48000 --jinja --chat-template-file <cell> — the 48k thinking budget is deliberately generous so a loop reflects a genuine attractor, not a clipped thinking phase.
Sampler (your model card / generation_config.json): temperature 1.0, top_p 0.95, top_k 64, min_p 0.0, repeat_penalty 1.0 — identical to @Raullen 's.
Generation: max_tokens 32768; 48 seeds (1000–1047) per cell.
Fixture: one frozen agent-style conversation captured right before a loop — system + user task + 89 tool definitions (a real media-automation agent; ~23k-token prompt). Heavier than the original 10-tool reproducer, same failure mode.
What varies between cells: only the chat template. fails = thinking-channel loop OR answer-channel runaway; the separate thinking-loop column is the pure thought\n… symptom @Raullen reported.

Results — 48 seeds each, F16, card sampler

chat template	fails / 48	fail rate	thinking-loop rate
`embedded` (the GGUF's own template)	21 / 48	43.8%	37.5%
`reinject-off` (history-turn reasoning re-injection disabled)	20 / 48	41.7%	39.6%
`pr35` (your PR #35, default)	17 / 48	35.4%	25.0%
`pr35-ptfalse` (PR #35, `pass-thinking = false`)	25 / 48	52.1%	45.8%

Reading

Not quantization. At F16 the embedded-template thinking-loop rate is ~38% (≈ @Raullen 's "3 of 5") — so the attractor lives in the release weights; the 4-bit observation reflects it rather than causing it.
The template helps but doesn't cure. PR #35 is the best of the four (44% → 35% fails) — a real reduction — yet a third of seeds still loop. Disabling history-turn reasoning re-injection barely moves the needle (42%), and pass-thinking=false makes it worse (52%). On the 12B this reads as a weight-level attractor that the template can mitigate but not remove.
It looks capacity-related, not a template defect. As I noted earlier in this thread, the same harness + same four templates run clean (0 loops / 0 runaways) on the larger gemma-4-26B-A4B sibling. Same scaffolding, no attractor — the 12B just falls into it.

Per-seed detail (which channel looped, the repeating unit, generated length, finish_reason) is written as JSON in the harness out_dir. Happy to share the captured fixture (solar_build_start.json) and any per-seed traces, and to re-run any sampler / template / flag combination you'd like to see for the next IT iteration. 🙏

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment