Instructions to use google/gemma-4-12B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-12B-it with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("google/gemma-4-12B-it") model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-12B-it") - Notebooks
- Google Colab
- Kaggle
gemma-4-12B-it: deterministic "thought\n thought\n …" degenerate loop on long agent prompts (~60% reproducer, 4-bit, repros across temperatures)
Hi Gemma team — wanted to surface a reliable model-behavior reproducer in case it's useful for the next instruction-tune iteration. This is a heads-up / data report, not a bug report against any specific runtime.
Symptom. When gemma-4-12B-it receives a long agent-style system prompt (~7,800 tokens) with multiple tool definitions (10 tools, OpenAI function-calling JSON schemas — exec_command, write_stdin, update_plan, etc.) and a user message asking the model to use one of those tools, the model enters a degenerate generation loop in roughly 3 of 5 trials:
content (first 200 chars): 'thought\nthought\nthought\nthought\n…' (× ~1000)
finish_reason: 'length'
tool_calls: []
The remaining 2/5 trials emit a clean tool call (exec_command({"cmd": "head -n 1 pyproject.toml"})) within ~1 s, so the model is clearly capable of solving the task — it just falls into the thought\n attractor often enough to be a practical reliability blocker.
Reproduction. All from the model card's recommended sampling (T=1.0, top_p=0.95, top_k=64, per generation_config.json). The attractor is essentially deterministic; sampling variance only controls how often the model escapes.
- Tested quant:
mlx-community/gemma-4-12B-it-4bit(Apple-Silicon MLX, 4-bit) - Control model: Qwen3.5-9B-4bit on the exact same payload → 5/5 clean tool calls in 1-8 s
Investigated and ruled out:
- Temperature: looped at T=0.0, 0.7, 1.0
- top_k: 0 / 40 / 64 / None
- Prompt cache: failures occur with both cold and warm cache
- Prompt phrasing: identical Codex-CLI system prompt and tool schemas as Qwen3.5 (which passes 5/5)
Why it looks like an attractor, not noise. The first few tokens after the user's task are consistently <|channel>thought-prefixed reasoning, and once the model commits to a reasoning chunk it tends to emit the literal token thought\n over and over instead of completing the chunk and switching to a function call. The 2 successful trials skip the reasoning channel entirely and go straight to <|tool_call>call:exec_command{…}<tool_call|>.
Smoking-gun snippet (non-streaming, /v1/chat/completions-equivalent generation):
content first 200: 'thought\nthought\nthought\nthought\n…' (× ~1000)
finish_reason: 'length'
What I'd be curious about: is the thought\n loop reproducible at fp16 / bf16 on the original release weights too, or is this attractor introduced/sharpened by 4-bit quantization? Happy to share the captured 47 KB request payload and the per-token logits trace from the looping segment if useful.
For reference / so this isn't duplicated: I've also tracked this downstream as a known-issue note in our runtime — https://github.com/raullenchai/Rapid-MLX/issues/686 — but the fix surface is in the model, not the runtime (the same 4-bit weights with simpler prompts work fine).
Thanks for the model — looking forward to v4.5/5 :)
Hi @Raullen ,
Thanks for addressing the issue and providing in detailed report. To help us reproduce and for further investigation, could you please share us:
- 47KB Payload request
- Per-token logit traces
- Exact-Prompt Format configuration
Hey @thnamratha
Will share with Claude my agentic loop test harness.
Gemma4 A4B passes it with zero loops or runaways.
Hi @thnamratha — here's the agentic-loop test harness I mentioned, packaged as a self-contained, pip-installable tool in our public repo so you can reproduce this on your side. It installs and runs independently of the rest of the project — its own requirements.txt / install.sh, no shared imports; the only third-party dependency is PyYAML, the core is pure standard library. Everything to download, install, and run it is below.
What it does (one line): it replays a frozen agentic coding conversation — the exact messages/tools captured right before a model looped — against a chat server across seeds × a sampler matrix, and reports a per-cell fail rate (fails = thinking-channel loops + answer-channel runaways). You vary only the chat template between cells, so the difference in fail rate is that template's effect — which directly answers "does template X stop the loop?".
1. Get it
git clone https://github.com/mann1x/omnimergekit.git
cd omnimergekit/tools/agentic-loop-harness
2. Install — three modes, pick one
# A) build a pinned CUDA llama-server from source (default)
./install.sh --mode build --cuda-arch 120 # 120=Blackwell, 90=Hopper, 89=Ada, 86=Ampere/3090
# B) use a llama-server binary you already have
./install.sh --mode byo-binary --llama-server-bin /path/to/llama-server
# C) drive an already-running OpenAI-compatible endpoint (vLLM, a gateway, …)
./install.sh --mode byo-endpoint --endpoint http://127.0.0.1:8000
All three create a local .venv and pip install -e .. Then:
source .env # exports LLAMA_SERVER_BIN (modes A/B) or the endpoint (mode C)
3. Point it at a model (Gemma-4-12B, F16 GGUF)
hf download google/gemma-4-12B-it --local-dir gemma-4-12B-it
python .llama.cpp/convert_hf_to_gguf.py gemma-4-12B-it \
--outfile gemma-4-12B-it-F16.gguf --outtype f16
Then set model.gguf: in the profile to that file. (We use F16 to remove quantization as a confound when studying the template; any quant works too.)
4. Run
agentic-loop-harness --profile profiles/gemma4.example.yaml
# or: python -m agentic_loop_harness --profile profiles/gemma4.example.yaml
Everything under test lives in that one profile YAML:
model.chat_template— a list of templates to compare, one cell each. The shipped example is the 4-cell comparison:embedded(the GGUF's own template),reinject-off(history-turn reasoning re-injection disabled),pr35(your PR #35, default mode),pr35-ptfalse(your PR #35, pass-thinking=false).server.reasoning_format/reasoning_budget,sampling.matrix,run.seeds,run.fixtures— set these to your own deployment settings. Drop in your own.jinjatemplates and your own sampler; the harness does the rest.
A per-template fail-rate table prints at the end, and per-seed detail (which channel looped, the repeating unit, generated lengths, finish_reason) lands as JSON in run.out_dir.
It's backend-agnostic: replay.py speaks only OpenAI /v1/chat/completions (streaming), so mode C works against vLLM or any compatible gateway (one endpoint per template, since the template is then fixed server-side).
I'll follow up in this thread with our first measurement round — the full methodology, the exact sampler and llama.cpp flags, the build, and the hardware/software details — as soon as our runs complete.
Hi @thnamratha — here's the first measurement round I promised, run with the harness above.
I deliberately tested gemma-4-12B-it at F16 to take quantization off the table: @Raullen 's original reproducer was 4-bit MLX, and the open question was whether the thought\n attractor is introduced/sharpened by 4-bit or already present in the release weights.
TL;DR — the loop reproduces at F16. It is not a quantization artifact: the full-precision release weights fall into the same thought\n… attractor on long agent prompts. The chat template modulates the rate but no template eliminates it.
Setup (the methodology you asked for)
- Model:
google/gemma-4-12B-it→ F16 GGUF (convert_hf_to_gguf.py --outtype f16). No quantization. - Backend: llama.cpp
llama-server(commit9724f664e), 1× RTX PRO 6000 (Blackwell). - Server flags:
-ngl 99 -c 600000 --parallel 5 -fa on -ctk q8_0 -ctv q8_0 --reasoning-format deepseek --reasoning-budget 48000 --jinja --chat-template-file <cell>— the 48k thinking budget is deliberately generous so a loop reflects a genuine attractor, not a clipped thinking phase. - Sampler (your model card /
generation_config.json):temperature 1.0, top_p 0.95, top_k 64, min_p 0.0, repeat_penalty 1.0— identical to @Raullen 's. - Generation:
max_tokens 32768; 48 seeds (1000–1047) per cell. - Fixture: one frozen agent-style conversation captured right before a loop — system + user task + 89 tool definitions (a real media-automation agent; ~23k-token prompt). Heavier than the original 10-tool reproducer, same failure mode.
- What varies between cells: only the chat template.
fails = thinking-channel loop OR answer-channel runaway; the separatethinking-loopcolumn is the purethought\n…symptom @Raullen reported.
Results — 48 seeds each, F16, card sampler
| chat template | fails / 48 | fail rate | thinking-loop rate |
|---|---|---|---|
embedded (the GGUF's own template) |
21 / 48 | 43.8% | 37.5% |
reinject-off (history-turn reasoning re-injection disabled) |
20 / 48 | 41.7% | 39.6% |
pr35 (your PR #35, default) |
17 / 48 | 35.4% | 25.0% |
pr35-ptfalse (PR #35, pass-thinking = false) |
25 / 48 | 52.1% | 45.8% |
Reading
- Not quantization. At F16 the embedded-template thinking-loop rate is ~38% (≈ @Raullen 's "3 of 5") — so the attractor lives in the release weights; the 4-bit observation reflects it rather than causing it.
- The template helps but doesn't cure. PR #35 is the best of the four (44% → 35% fails) — a real reduction — yet a third of seeds still loop. Disabling history-turn reasoning re-injection barely moves the needle (42%), and
pass-thinking=falsemakes it worse (52%). On the 12B this reads as a weight-level attractor that the template can mitigate but not remove. - It looks capacity-related, not a template defect. As I noted earlier in this thread, the same harness + same four templates run clean (0 loops / 0 runaways) on the larger
gemma-4-26B-A4Bsibling. Same scaffolding, no attractor — the 12B just falls into it.
Per-seed detail (which channel looped, the repeating unit, generated length, finish_reason) is written as JSON in the harness out_dir. Happy to share the captured fixture (solar_build_start.json) and any per-seed traces, and to re-run any sampler / template / flag combination you'd like to see for the next IT iteration. 🙏