Qwen 3.6 Agentic Chat Template for llama.cpp
A Jinja chat template for Qwen 3.6 tuned for llama.cpp / ik_llama servers driving agentic coding harnesses (OpenCode, Claude Code-style loops, aider, etc.).
Built on froggeric/Qwen-Fixed-Chat-Templates v21.3 — all of froggeric's fixes are inherited (KV-cache-stable history, thinking toggles, minja compatibility, two-tier error escalation, preserve_thinking, payload truncation). This repo adds the agentic hardening layer on top and tracks upstream releases.
Quick install (llama.cpp)
llama-server ... --jinja --chat-template-file chat_template.jinja
Recent llama.cpp builds auto-derive the tool-call parser from the template — the default XML tool format parses out of the box, including large multiline write payloads (the server does the JSON escaping mechanically; the model never escapes anything).
Older builds (pre-auto-parser, e.g. older ik_llama) only parse Hermes-JSON tool calls. For those, add:
--chat-template-kwargs '{"tool_call_format":"json"}'
Symptom that you need this: tool calls silently appear as plain <function=...> text instead of executing.
What this adds over upstream
1. Agentic loop detection (template-level, engine-agnostic)
Samplers (DRY, repetition penalties) only catch verbatim token repeats. Agentic loops repeat actions, not tokens — same tool called with varying arguments, or an act/verify pair cycling forever because the verification can never succeed. Two detectors run over the rendered history (deterministic per message position, so KV cache prefixes stay stable):
- Identical-response detector — compares each tool response against the previous two. Results matching at period 1 or 2 for
repeat_nudge_afterconsecutive tool turns (default 6) means no new information is entering the loop; a system warning is injected telling the model the approach, not the arguments, must change. Productive loops (edit → test → edit → test with changing test output) are provably untouched — changing results reset the counter. The no-progress detector skips mutating tools (mutating_toolskwarg): writing 6 files returns 6 identical "success" strings — productive, not a loop — while a query/read tool returning identical results is the stuck signal. Failure loops on any tool are still caught by the consecutive-error detector.
Measured on 381 real agentic sessions, this design fires ~90% less often than naive same-tool / any-result counting while preserving every genuine stuck-loop catch. (An earlier same-tool-streak detector was removed in v5: 90% of its firings were productive bash/read/webfetch sequences, and its real catches were already covered by the no-progress detector.)
2. Harness-aware error markers
Upstream's tool-failure detector targets CLI output (traceback, command not found, ...). Agentic harnesses wrap errors in their own envelopes, which all evaded it. Added (verified against OpenCode's actual error strings): invalid input for tool, json parsing failed, unknown tool, error repairing, tool execution failed, err:.
Deliberately not matched: The user rejected permission ... and Tool execution aborted — these are user actions, and telling the model to "retry with corrected arguments" after a human said no is harmful. If you add markers, preserve this property.
3. Everything is a kwarg
| kwarg | default | purpose |
|---|---|---|
tool_call_format |
xml |
json for engines whose parser only reads Hermes JSON (older ik_llama) |
repeat_nudge_after |
6 |
identical-result turns before the no-progress warning; 0 disables it (0 tokens if never hit) |
mutating_tools |
cross-harness union (~55 names) | space-delimited tool names excluded from no-progress detection, matched case-insensitively. Covers OpenCode, Claude Code, Hermes, Cline, Roo, OpenHands, Gemini CLI, Codex, Goose, Continue, Cursor, Windsurf, Amazon Q, SWE-agent, pi, MCP-filesystem out of the box. Override for a custom harness. MCP-prefixed tools (mcp_x_write_file, puppeteer_puppeteer_navigate) aren't matched by design — they rarely return identical results, so they don't false-fire; add the exact prefixed name if one ever does. |
think_on_tool_failure |
false |
true keeps the think block open during error escalation (upstream empties it at 2+ consecutive failures) — recommended for RL-trained reasoning models that plan recovery in-think |
enable_thinking / preserve_thinking / max_tool_arg_chars / max_tool_response_chars |
upstream | inherited from froggeric v21.3 |
Set via --chat-template-kwargs '{"repeat_nudge_after":4}' etc.
Set kwargs on the llama-server command line, not in your client config. Some harnesses (OpenCode's
@ai-sdk/openai-compatibleprovider, #26233) silently drop client-sidechat_template_kwargs/extraBody, so a kwarg set inopencode.jsonnever reaches the server.--chat-template-kwargs '{...}'is applied server-side at render time and always works, regardless of client.
Files
chat_template.jinja— the templatechat_template_oneline.txt— single-line variant for engines that require ittests/test_template.py— render-test suite (Pythonjinja2; run after any edit)scripts/measure_loop_nudges.py— reproduce the loop-nudge-reduction metric on your own OpenCode sessions
llama.cpp integration notes (verified)
Grammar enforcement is active when llama.cpp's auto-parser (b8227+) can analyze the template — this template is auto-parser-legible (
tool_mode: TAG_WITH_TAGGED). Tool calls are constrained by a GBNF grammar compiled from your tool schemas: malformed calls and invalid argument JSON are token-level impossible, verified up to 163 simultaneous tool schemas.Grammar cannot prevent
max_tokenstruncation. If a tool call is cut off mid-arguments, the client sees "JSON parsing failed" on a truncated (not mis-escaped) string. Rule: your client's output-token limit must exceedreasoning budget + largest expected tool call.Note for automated code review. The template strips llama.cpp's media placeholder token (the word
mediain double-underscores inside angle brackets) from assistant/tool history to prevent a tokenizer 400 on some builds. HTML-interpreting fetch tools often eat that angle-bracket token, making the guard lines look likeif '' in content/split('')— an empty-string comparison that would "strip everything." That is a rendering artifact of the fetcher, not the code. Always review the RAW.jinjafile; the guards test for the token's presence first and never touch normal content.Vision/multimodal: llama-server replaces media with
<__media__>markers inside content parts; this template passes them through untouched (supports_typed_content: false+supports_string_content: trueis the correct, preferred caps state for llama.cpp — the server flattens for you). Verified end-to-end with image input.supports_preserve_reasoning: true— llama.cpp's nativepreserve_thinkingcontext variable is honored, so server-side reasoning-history controls work.XML vs JSON for string-argument harnesses: many harnesses (OpenCode/Vercel AI SDK, etc.) send tool-call
argumentsas JSON strings, not objects. In XML mode the template can't decompose a JSON string into<parameter>blocks (minja has no JSON parser), so such history renders as raw JSON inside<function=>— a poor in-context example. If your harness sends string arguments, prefer JSON mode (tool_call_format: json), which renders them cleanly.Testing pitfall: with
--reasoning-format none, thinking streams intocontent— when testingtool_choice: requiredor grammar behavior, disable thinking (<|think_off|>) or budget generously, and remember the grammar permits content before a tool call; prose output under a smallmax_tokensis not evidence the grammar is absent.
Changelog
v7 (2026-07-05) — Runtime-hardening pass (from an external infrastructure-grade review; each item verified against real captured traffic and live on ik/minja). (1) repeat_nudge_after=0 now disables the no-progress warning (was a footgun: 0 >= 0 fired it on every tool turn). (2) Tool-output fidelity — the <tool_response> body is now rendered untrimmed, preserving leading/trailing whitespace in diffs, Makefile tabs, heredocs and exact compiler output (detection still runs on the trimmed copy). (3) Long-error detection — a response starting with Traceback/error:/fatal:/exception: counts as a failure regardless of length, catching pytest tracebacks and compiler dumps that the <500-char gate missed, without widening the substring false-positive surface. (4) Head+tail truncation for max_tool_response_chars — keeps both ends so the actionable failure summary (usually at the end of tool output) survives. Added scripts/measure_loop_nudges.py (reproduce the loop-nudge reduction metric on your own opencode.db) and expanded tests/test_template.py with the review's cases.
v6 (2026-07-05) — Cross-harness mutating_tools default. v5's default was OpenCode-centric; v6 ships a ~55-name union covering the native write/shell/browser-action tools of 16+ agentic harnesses (OpenCode, Claude Code, Hermes, Cline, Roo Code, OpenHands, Gemini CLI, Codex CLI, Goose, Continue, Cursor, Windsurf, Amazon Q, SWE-agent, pi, standard MCP filesystem), so the no-progress detector's mutating-exclusion works correctly regardless of harness. The match is now case-insensitive (Claude Code sends Write/Edit/Bash capitalized). The list is a template-internal comparison table — never rendered into the prompt, so it costs zero tokens at runtime regardless of length. MCP-prefixed tool variants are intentionally not matched (they rarely return identical results, so they don't false-fire); override per-harness if needed. Verified: all 18 tested harness write-tools go silent on identical-success loops while query/read-tool loops still fire.
v5 (2026-07-05) — Loop-detector rework, grounded in 381 real agentic sessions. (1) Removed the same-tool-streak detector — measurement showed ~90% of its firings were productive sequences (bash 105, webfetch 61, read 41, edit 27 out of 324 total), and its genuine catches were already covered by the no-progress and consecutive-error detectors. (2) No-progress detector now skips mutating tools via the mutating_tools kwarg (union default, overridable): identical "success" results from write/edit/etc. are productive, not a loop; only query/read tools returning identical results signal a stuck loop. Net: total loop-nudge events drop ~90% (427 → 42 across the sample) with every real stuck-loop catch preserved. The streak_nudge_after kwarg is retired.
v4 (2026-07-05) — (1) OpenAI tool-envelope unwrapping: <tools> now renders the inner function object instead of the {"type":"function","function":{...}} wrapper — ~40 tokens saved per tool definition, per request (≈6.5k tokens at 163 tools); callers sending bare function objects are unaffected. (2) Self-healing spilled tool calls: an assistant history message with an unclosed think block and a newline-preceded <tool_call> in its text gets repaired — text before the call becomes reasoning instead of leaking as a bad in-context example (inline-quoted mentions untouched). (3) think_on_tool_failure kwarg (default false = upstream behavior): set true to keep reasoning open during error escalation instead of prefilling an empty think block — models RL-trained to plan in-think may recover better when allowed to reason about failures. (4) Removed the ⚠️ glyph from all system warnings (plain "SYSTEM WARNING:"), avoiding decorative emoji in prompts for models that are unstable in emoji token space.
v3 (2026-07-04) — Media-marker injection guard. The literal string <__media__> inside assistant replies, reasoning, or tool responses poisons tokenization on some llama.cpp-family builds (verified: ik_llama returns 400 Failed to tokenize prompt for every subsequent turn — the conversation is bricked; mainline b9781 tolerates it). Sources in the wild: models hallucinating the marker into a reply, or fetched web/file content that mentions it (e.g. llama.cpp docs). v3 strips the marker from assistant content, reasoning, and tool responses at render time; user-message media parts are untouched, so real vision input is unaffected (render-tested + live-verified on ik_llama with image input).
v2 (2026-07-04) — Date awareness: when the runtime provides strftime_now (llama.cpp does; fed from real server time), the system block ends with "Today is YYYY-MM-DD." — guarded with is defined, so portable runtimes render unchanged. Added the llama.cpp integration notes above.
v1 (2026-07-04) — Initial release: froggeric v21.3 base + identical-response loop detector, same-tool-streak detector, harness-aware error markers, threshold kwargs. Template version string: qwen3.6-llamacpp-agentic-v1.
Credits
The heavy lifting — years of Qwen template fixes, minja compatibility work, the thinking/caching machinery — is froggeric's. Report template-core issues there; report agentic-layer issues here.