Qwen 3.6 Agentic Chat Template for llama.cpp

A Jinja chat template for Qwen 3.6 tuned for llama.cpp / ik_llama servers driving agentic coding harnesses (OpenCode, Claude Code-style loops, aider, etc.).

Built on froggeric/Qwen-Fixed-Chat-Templates v21.3 — all of froggeric's fixes are inherited (KV-cache-stable history, thinking toggles, minja compatibility, two-tier error escalation, preserve_thinking, payload truncation). This repo adds the agentic hardening layer on top and tracks upstream releases.

Quick install (llama.cpp)

llama-server ... --jinja --chat-template-file chat_template.jinja

Recent llama.cpp builds auto-derive the tool-call parser from the template — the default XML tool format parses out of the box, including large multiline write payloads (the server does the JSON escaping mechanically; the model never escapes anything).

Older builds (pre-auto-parser, e.g. older ik_llama) only parse Hermes-JSON tool calls. For those, add:

--chat-template-kwargs '{"tool_call_format":"json"}'

Symptom that you need this: tool calls silently appear as plain <function=...> text instead of executing.

What this adds over upstream

1. Agentic loop detection (template-level, engine-agnostic)

Samplers (DRY, repetition penalties) only catch verbatim token repeats. Agentic loops repeat actions, not tokens — same tool called with varying arguments, or an act/verify pair cycling forever because the verification can never succeed. Two detectors run over the rendered history (deterministic per message position, so KV cache prefixes stay stable):

Identical-response detector — compares each tool response against the previous two. Results matching at period 1 or 2 for repeat_nudge_after consecutive tool turns (default 6) means no new information is entering the loop; a system warning is injected telling the model the approach, not the arguments, must change. Productive loops (edit → test → edit → test with changing test output) are provably untouched — changing results reset the counter. The no-progress detector skips mutating tools (mutating_tools kwarg): writing 6 files returns 6 identical "success" strings — productive, not a loop — while a query/read tool returning identical results is the stuck signal. Failure loops on any tool are still caught by the consecutive-error detector.

Measured on 381 real agentic sessions, this design fires ~90% less often than naive same-tool / any-result counting while preserving every genuine stuck-loop catch. (An earlier same-tool-streak detector was removed in v5: 90% of its firings were productive bash/read/webfetch sequences, and its real catches were already covered by the no-progress detector.)

2. Harness-aware error markers

Upstream's tool-failure detector targets CLI output (traceback, command not found, ...). Agentic harnesses wrap errors in their own envelopes, which all evaded it. Added (verified against OpenCode's actual error strings): invalid input for tool, json parsing failed, unknown tool, error repairing, tool execution failed, err:.

Deliberately not matched: The user rejected permission ... and Tool execution aborted — these are user actions, and telling the model to "retry with corrected arguments" after a human said no is harmful. If you add markers, preserve this property.

3. Everything is a kwarg

kwarg	default	purpose
`tool_call_format`	`xml`	`json` for engines whose parser only reads Hermes JSON (older ik_llama)
`repeat_nudge_after`	`6`	identical-result turns before the no-progress warning; `0` disables it (0 tokens if never hit)
`mutating_tools`	cross-harness union (~55 names)	space-delimited tool names excluded from no-progress detection, matched case-insensitively. Covers OpenCode, Claude Code, Hermes, Cline, Roo, OpenHands, Gemini CLI, Codex, Goose, Continue, Cursor, Windsurf, Amazon Q, SWE-agent, pi, MCP-filesystem out of the box. Override for a custom harness. MCP-prefixed tools (`mcp_x_write_file`, `puppeteer_puppeteer_navigate`) aren't matched by design — they rarely return identical results, so they don't false-fire; add the exact prefixed name if one ever does.
`think_on_tool_failure`	`false`	`true` keeps the think block open during error escalation (upstream empties it at 2+ consecutive failures) — recommended for RL-trained reasoning models that plan recovery in-think
`enable_thinking` / `preserve_thinking` / `max_tool_arg_chars` / `max_tool_response_chars`	upstream	inherited from froggeric v21.3

Set via --chat-template-kwargs '{"repeat_nudge_after":4}' etc.

Set kwargs on the llama-server command line, not in your client config. Some harnesses (OpenCode's @ai-sdk/openai-compatible provider, #26233) silently drop client-side chat_template_kwargs / extraBody, so a kwarg set in opencode.json never reaches the server. --chat-template-kwargs '{...}' is applied server-side at render time and always works, regardless of client.

Files

chat_template.jinja — the template
chat_template_oneline.txt — single-line variant for engines that require it
tests/test_template.py — render-test suite (Python jinja2; run after any edit)
scripts/measure_loop_nudges.py — reproduce the loop-nudge-reduction metric on your own OpenCode sessions

llama.cpp integration notes (verified)

Grammar enforcement is active when llama.cpp's auto-parser (b8227+) can analyze the template — this template is auto-parser-legible (tool_mode: TAG_WITH_TAGGED). Tool calls are constrained by a GBNF grammar compiled from your tool schemas: malformed calls and invalid argument JSON are token-level impossible, verified up to 163 simultaneous tool schemas.
Grammar cannot prevent max_tokens truncation. If a tool call is cut off mid-arguments, the client sees "JSON parsing failed" on a truncated (not mis-escaped) string. Rule: your client's output-token limit must exceed reasoning budget + largest expected tool call.

Note for automated code review. The template strips llama.cpp's media placeholder token (the word media in double-underscores inside angle brackets) from assistant/tool history to prevent a tokenizer 400 on some builds. HTML-interpreting fetch tools often eat that angle-bracket token, making the guard lines look like if '' in content / split('') — an empty-string comparison that would "strip everything." That is a rendering artifact of the fetcher, not the code. Always review the RAW .jinja file; the guards test for the token's presence first and never touch normal content.
Vision/multimodal: llama-server replaces media with <__media__> markers inside content parts; this template passes them through untouched (supports_typed_content: false + supports_string_content: true is the correct, preferred caps state for llama.cpp — the server flattens for you). Verified end-to-end with image input.
supports_preserve_reasoning: true — llama.cpp's native preserve_thinking context variable is honored, so server-side reasoning-history controls work.
XML vs JSON for string-argument harnesses: many harnesses (OpenCode/Vercel AI SDK, etc.) send tool-call arguments as JSON strings, not objects. In XML mode the template can't decompose a JSON string into <parameter> blocks (minja has no JSON parser), so such history renders as raw JSON inside <function=> — a poor in-context example. If your harness sends string arguments, prefer JSON mode (tool_call_format: json), which renders them cleanly.
Testing pitfall: with --reasoning-format none, thinking streams into content — when testing tool_choice: required or grammar behavior, disable thinking (<|think_off|>) or budget generously, and remember the grammar permits content before a tool call; prose output under a small max_tokens is not evidence the grammar is absent.

Changelog

v7 (2026-07-05) — Runtime-hardening pass (from an external infrastructure-grade review; each item verified against real captured traffic and live on ik/minja). (1) repeat_nudge_after=0 now disables the no-progress warning (was a footgun: 0 >= 0 fired it on every tool turn). (2) Tool-output fidelity — the <tool_response> body is now rendered untrimmed, preserving leading/trailing whitespace in diffs, Makefile tabs, heredocs and exact compiler output (detection still runs on the trimmed copy). (3) Long-error detection — a response starting with Traceback/error:/fatal:/exception: counts as a failure regardless of length, catching pytest tracebacks and compiler dumps that the <500-char gate missed, without widening the substring false-positive surface. (4) Head+tail truncation for max_tool_response_chars — keeps both ends so the actionable failure summary (usually at the end of tool output) survives. Added scripts/measure_loop_nudges.py (reproduce the loop-nudge reduction metric on your own opencode.db) and expanded tests/test_template.py with the review's cases.

v6 (2026-07-05) — Cross-harness mutating_tools default. v5's default was OpenCode-centric; v6 ships a ~55-name union covering the native write/shell/browser-action tools of 16+ agentic harnesses (OpenCode, Claude Code, Hermes, Cline, Roo Code, OpenHands, Gemini CLI, Codex CLI, Goose, Continue, Cursor, Windsurf, Amazon Q, SWE-agent, pi, standard MCP filesystem), so the no-progress detector's mutating-exclusion works correctly regardless of harness. The match is now case-insensitive (Claude Code sends Write/Edit/Bash capitalized). The list is a template-internal comparison table — never rendered into the prompt, so it costs zero tokens at runtime regardless of length. MCP-prefixed tool variants are intentionally not matched (they rarely return identical results, so they don't false-fire); override per-harness if needed. Verified: all 18 tested harness write-tools go silent on identical-success loops while query/read-tool loops still fire.

v5 (2026-07-05) — Loop-detector rework, grounded in 381 real agentic sessions. (1) Removed the same-tool-streak detector — measurement showed ~90% of its firings were productive sequences (bash 105, webfetch 61, read 41, edit 27 out of 324 total), and its genuine catches were already covered by the no-progress and consecutive-error detectors. (2) No-progress detector now skips mutating tools via the mutating_tools kwarg (union default, overridable): identical "success" results from write/edit/etc. are productive, not a loop; only query/read tools returning identical results signal a stuck loop. Net: total loop-nudge events drop ~90% (427 → 42 across the sample) with every real stuck-loop catch preserved. The streak_nudge_after kwarg is retired.

v4 (2026-07-05) — (1) OpenAI tool-envelope unwrapping: <tools> now renders the inner function object instead of the {"type":"function","function":{...}} wrapper — ~40 tokens saved per tool definition, per request (≈6.5k tokens at 163 tools); callers sending bare function objects are unaffected. (2) Self-healing spilled tool calls: an assistant history message with an unclosed think block and a newline-preceded <tool_call> in its text gets repaired — text before the call becomes reasoning instead of leaking as a bad in-context example (inline-quoted mentions untouched). (3) think_on_tool_failure kwarg (default false = upstream behavior): set true to keep reasoning open during error escalation instead of prefilling an empty think block — models RL-trained to plan in-think may recover better when allowed to reason about failures. (4) Removed the ⚠️ glyph from all system warnings (plain "SYSTEM WARNING:"), avoiding decorative emoji in prompts for models that are unstable in emoji token space.

v3 (2026-07-04) — Media-marker injection guard. The literal string <__media__> inside assistant replies, reasoning, or tool responses poisons tokenization on some llama.cpp-family builds (verified: ik_llama returns 400 Failed to tokenize prompt for every subsequent turn — the conversation is bricked; mainline b9781 tolerates it). Sources in the wild: models hallucinating the marker into a reply, or fetched web/file content that mentions it (e.g. llama.cpp docs). v3 strips the marker from assistant content, reasoning, and tool responses at render time; user-message media parts are untouched, so real vision input is unaffected (render-tested + live-verified on ik_llama with image input).

v2 (2026-07-04) — Date awareness: when the runtime provides strftime_now (llama.cpp does; fed from real server time), the system block ends with "Today is YYYY-MM-DD." — guarded with is defined, so portable runtimes render unchanged. Added the llama.cpp integration notes above.

v1 (2026-07-04) — Initial release: froggeric v21.3 base + identical-response loop detector, same-tool-streak detector, harness-aware error markers, threshold kwargs. Template version string: qwen3.6-llamacpp-agentic-v1.

Credits

The heavy lifting — years of Qwen template fixes, minja compatibility work, the thinking/caching machinery — is froggeric's. Report template-core issues there; report agentic-layer issues here.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support