gemma-4-12B-it: deterministic "thought\n thought\n …" degenerate loop on long agent prompts (~60% reproducer, 4-bit, repros across temperatures)

#41
by Raullen - opened

Hi Gemma team — wanted to surface a reliable model-behavior reproducer in case it's useful for the next instruction-tune iteration. This is a heads-up / data report, not a bug report against any specific runtime.

Symptom. When gemma-4-12B-it receives a long agent-style system prompt (~7,800 tokens) with multiple tool definitions (10 tools, OpenAI function-calling JSON schemas — exec_command, write_stdin, update_plan, etc.) and a user message asking the model to use one of those tools, the model enters a degenerate generation loop in roughly 3 of 5 trials:

content (first 200 chars): 'thought\nthought\nthought\nthought\n…'  (× ~1000)
finish_reason: 'length'
tool_calls: []

The remaining 2/5 trials emit a clean tool call (exec_command({"cmd": "head -n 1 pyproject.toml"})) within ~1 s, so the model is clearly capable of solving the task — it just falls into the thought\n attractor often enough to be a practical reliability blocker.

Reproduction. All from the model card's recommended sampling (T=1.0, top_p=0.95, top_k=64, per generation_config.json). The attractor is essentially deterministic; sampling variance only controls how often the model escapes.

  • Tested quant: mlx-community/gemma-4-12B-it-4bit (Apple-Silicon MLX, 4-bit)
  • Control model: Qwen3.5-9B-4bit on the exact same payload → 5/5 clean tool calls in 1-8 s

Investigated and ruled out:

  • Temperature: looped at T=0.0, 0.7, 1.0
  • top_k: 0 / 40 / 64 / None
  • Prompt cache: failures occur with both cold and warm cache
  • Prompt phrasing: identical Codex-CLI system prompt and tool schemas as Qwen3.5 (which passes 5/5)

Why it looks like an attractor, not noise. The first few tokens after the user's task are consistently <|channel>thought-prefixed reasoning, and once the model commits to a reasoning chunk it tends to emit the literal token thought\n over and over instead of completing the chunk and switching to a function call. The 2 successful trials skip the reasoning channel entirely and go straight to <|tool_call>call:exec_command{…}<tool_call|>.

Smoking-gun snippet (non-streaming, /v1/chat/completions-equivalent generation):

content first 200: 'thought\nthought\nthought\nthought\n…' (× ~1000)
finish_reason: 'length'

What I'd be curious about: is the thought\n loop reproducible at fp16 / bf16 on the original release weights too, or is this attractor introduced/sharpened by 4-bit quantization? Happy to share the captured 47 KB request payload and the per-token logits trace from the looping segment if useful.

For reference / so this isn't duplicated: I've also tracked this downstream as a known-issue note in our runtime — https://github.com/raullenchai/Rapid-MLX/issues/686 — but the fix surface is in the model, not the runtime (the same 4-bit weights with simpler prompts work fine).

Thanks for the model — looking forward to v4.5/5 :)

Google org

Hi @Raullen ,

Thanks for addressing the issue and providing in detailed report. To help us reproduce and for further investigation, could you please share us:

  1. 47KB Payload request
  2. Per-token logit traces
  3. Exact-Prompt Format configuration

Hey @thnamratha
Will share with Claude my agentic loop test harness.
Gemma4 A4B passes it with zero loops or runaways.

Hi @thnamratha — here's the agentic-loop test harness I mentioned, packaged as a self-contained, pip-installable tool in our public repo so you can reproduce this on your side. It installs and runs independently of the rest of the project — its own requirements.txt / install.sh, no shared imports; the only third-party dependency is PyYAML, the core is pure standard library. Everything to download, install, and run it is below.

What it does (one line): it replays a frozen agentic coding conversation — the exact messages/tools captured right before a model looped — against a chat server across seeds × a sampler matrix, and reports a per-cell fail rate (fails = thinking-channel loops + answer-channel runaways). You vary only the chat template between cells, so the difference in fail rate is that template's effect — which directly answers "does template X stop the loop?".

1. Get it

git clone https://github.com/mann1x/omnimergekit.git
cd omnimergekit/tools/agentic-loop-harness

2. Install — three modes, pick one

# A) build a pinned CUDA llama-server from source (default)
./install.sh --mode build --cuda-arch 120        # 120=Blackwell, 90=Hopper, 89=Ada, 86=Ampere/3090

# B) use a llama-server binary you already have
./install.sh --mode byo-binary --llama-server-bin /path/to/llama-server

# C) drive an already-running OpenAI-compatible endpoint (vLLM, a gateway, …)
./install.sh --mode byo-endpoint --endpoint http://127.0.0.1:8000

All three create a local .venv and pip install -e .. Then:

source .env        # exports LLAMA_SERVER_BIN (modes A/B) or the endpoint (mode C)

3. Point it at a model (Gemma-4-12B, F16 GGUF)

hf download google/gemma-4-12B-it --local-dir gemma-4-12B-it
python .llama.cpp/convert_hf_to_gguf.py gemma-4-12B-it \
    --outfile gemma-4-12B-it-F16.gguf --outtype f16

Then set model.gguf: in the profile to that file. (We use F16 to remove quantization as a confound when studying the template; any quant works too.)

4. Run

agentic-loop-harness --profile profiles/gemma4.example.yaml
# or:  python -m agentic_loop_harness --profile profiles/gemma4.example.yaml

Everything under test lives in that one profile YAML:

  • model.chat_template — a list of templates to compare, one cell each. The shipped example is the 4-cell comparison: embedded (the GGUF's own template), reinject-off (history-turn reasoning re-injection disabled), pr35 (your PR #35, default mode), pr35-ptfalse (your PR #35, pass-thinking=false).
  • server.reasoning_format / reasoning_budget, sampling.matrix, run.seeds, run.fixturesset these to your own deployment settings. Drop in your own .jinja templates and your own sampler; the harness does the rest.

A per-template fail-rate table prints at the end, and per-seed detail (which channel looped, the repeating unit, generated lengths, finish_reason) lands as JSON in run.out_dir.

It's backend-agnostic: replay.py speaks only OpenAI /v1/chat/completions (streaming), so mode C works against vLLM or any compatible gateway (one endpoint per template, since the template is then fixed server-side).

I'll follow up in this thread with our first measurement round — the full methodology, the exact sampler and llama.cpp flags, the build, and the hardware/software details — as soon as our runs complete.

Hi @thnamratha — here's the first measurement round I promised, run with the harness above.

I deliberately tested gemma-4-12B-it at F16 to take quantization off the table: @Raullen 's original reproducer was 4-bit MLX, and the open question was whether the thought\n attractor is introduced/sharpened by 4-bit or already present in the release weights.

TL;DR — the loop reproduces at F16. It is not a quantization artifact: the full-precision release weights fall into the same thought\n… attractor on long agent prompts. The chat template modulates the rate but no template eliminates it.

Setup (the methodology you asked for)

  • Model: google/gemma-4-12B-itF16 GGUF (convert_hf_to_gguf.py --outtype f16). No quantization.
  • Backend: llama.cpp llama-server (commit 9724f664e), 1× RTX PRO 6000 (Blackwell).
  • Server flags: -ngl 99 -c 600000 --parallel 5 -fa on -ctk q8_0 -ctv q8_0 --reasoning-format deepseek --reasoning-budget 48000 --jinja --chat-template-file <cell> — the 48k thinking budget is deliberately generous so a loop reflects a genuine attractor, not a clipped thinking phase.
  • Sampler (your model card / generation_config.json): temperature 1.0, top_p 0.95, top_k 64, min_p 0.0, repeat_penalty 1.0 — identical to @Raullen 's.
  • Generation: max_tokens 32768; 48 seeds (1000–1047) per cell.
  • Fixture: one frozen agent-style conversation captured right before a loop — system + user task + 89 tool definitions (a real media-automation agent; ~23k-token prompt). Heavier than the original 10-tool reproducer, same failure mode.
  • What varies between cells: only the chat template. fails = thinking-channel loop OR answer-channel runaway; the separate thinking-loop column is the pure thought\n… symptom @Raullen reported.

Results — 48 seeds each, F16, card sampler

chat template fails / 48 fail rate thinking-loop rate
embedded (the GGUF's own template) 21 / 48 43.8% 37.5%
reinject-off (history-turn reasoning re-injection disabled) 20 / 48 41.7% 39.6%
pr35 (your PR #35, default) 17 / 48 35.4% 25.0%
pr35-ptfalse (PR #35, pass-thinking = false) 25 / 48 52.1% 45.8%

Reading

  1. Not quantization. At F16 the embedded-template thinking-loop rate is ~38% (≈ @Raullen 's "3 of 5") — so the attractor lives in the release weights; the 4-bit observation reflects it rather than causing it.
  2. The template helps but doesn't cure. PR #35 is the best of the four (44% → 35% fails) — a real reduction — yet a third of seeds still loop. Disabling history-turn reasoning re-injection barely moves the needle (42%), and pass-thinking=false makes it worse (52%). On the 12B this reads as a weight-level attractor that the template can mitigate but not remove.
  3. It looks capacity-related, not a template defect. As I noted earlier in this thread, the same harness + same four templates run clean (0 loops / 0 runaways) on the larger gemma-4-26B-A4B sibling. Same scaffolding, no attractor — the 12B just falls into it.

Per-seed detail (which channel looped, the repeating unit, generated length, finish_reason) is written as JSON in the harness out_dir. Happy to share the captured fixture (solar_build_start.json) and any per-seed traces, and to re-run any sampler / template / flag combination you'd like to see for the next IT iteration. 🙏

Sign up or log in to comment