chat template v21.3

#51

by froggeric - opened 2 days ago

Discussion

froggeric

Owner 2 days ago

•

edited 2 days ago

New version: v21.3

A few fixes from recent PRs:

preserve_thinking defaults to true again for better KV cache hits
Narrowed the tool error scanner to the first 80 chars so long successful JSON responses stop triggering false retry loops
Gated <|think_off|> tags to user/system messages to prevent prompt injection from tool outputs
The parser now requires </think> at the start of a line so it stops cutting off messages that just quote the tag
Restored the \n\n spacing for the non-thinking generation prompt to fix the streaming duplication bug
Added support for Anthropic's message.thinking parameter
Reasoning Bypass Hallucination Fix
Optional JSON Tool Format Kwarg

Big thanks to @Moore2877 @choongng and @batsclamp for the PRs and bug reports.

PS: initial v21 included a PR to switch JSON tool format. I reverted it because it broke vLLM's qwen3_coder parser. Qwen is trained on XML <function=name>, which would have mean no more universal compatibility.

veldierin

2 days ago

•

edited 2 days ago

@froggeric - can you please adjust the <IMPORTANT> block to this? there was a discussion earlier with regards to the </think> showing up sometimes as prefix, and one mentioned two lines being adjusted. this is what ended up fixing it for me.

<IMPORTANT>
Reminder:
- You can use the <think></think> block to plan your next tool call OR to synthesize data and formulate your final response to the user.
- ALL explanation and reasoning MUST be placed strictly inside the <think></think> block.
- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags.
- If you choose to call a tool, you MUST output the <tool_call> block IMMEDIATELY after thinking, with NO conversational text before it.
- The <tool_call> and <function> tags MUST be at the very beginning of a new line, with NO spaces or indentation before them.
- To call multiple functions, output a separate, completely closed <tool_call></tool_call> block for EACH function. Do NOT nest <tool_call> blocks.
- If you have all necessary data, provide your final answer directly to the user without any tool call.
</IMPORTANT>

froggeric

Owner 2 days ago

This is a brilliant catch! By removing the explicit </think> token from the imperative rules, we remove the token bias that causes the model to hallucinate it as a prefix when reasoning is disabled.

I have just merged this optimization and released it as v21.2. Thank you for the contribution!

froggeric changed discussion title from chat template v21.1 to chat template v21.2 2 days ago

froggeric changed discussion title from chat template v21.2 to chat template v21.3 2 days ago

laser50

2 days ago

•

edited 2 days ago

Hey, a (mildly stupid) question; Do you actually suggest turning on tool_call_format="json" in the kwargs when using Llama.cpp? I'm currently not seeing any issues without it, but if it would be better to do so, I will.

Moore2877

2 days ago

Hey, a (mildly stupid) question; Do you actually suggest turning on tool_call_format="json" in the kwargs when using Llama.cpp? I'm currently not seeing any issues without it, but if it would be better to do so, I will.

My build of llama.cpp started grammar correcting the xml and then never running the tool calls recently, but I just started using this template as well. It could have been caused by a recent llama build but my solution was to switch it to JSON.

froggeric

Owner 2 days ago

Hey, a (mildly stupid) question; Do you actually suggest turning on tool_call_format="json" in the kwargs when using Llama.cpp? I'm currently not seeing any issues without it, but if it would be better to do so, I will.

If you are not seeing any issue I would not recommend it. I think recent llama.cpp builds have full support for Qwen XML. I certainly never experienced any problems with it. I used various recent builds of llama.cpp , including custom ones at the beginning for the MTP PRs. I have used the OpenAI and Anthropic endpoints, with Claude Code, Qwen Code, iFlow, with multiples long turns sessions, heavy with tool use.

This is for what I believe are only a few specific cases, with maybe outdated or custom servers or harnesses, which do not yet fully support Qwen XML. I have tried those, but this could include Hermes Agent or LMStudio.

fatfacehugger

2 days ago

But the dedicated readme section for the JSON bit says if you are using something like llama-server then JSON will be better? I'm using latest llama.cpp but actually connecting via llama-server. Please clarify! 😃

froggeric

Owner 2 days ago

But the dedicated readme section for the JSON bit says if you are using something like llama-server then JSON will be better? I'm using latest llama.cpp but actually connecting via llama-server. Please clarify! 😃

Good point. I have updated the readme

choongng

1 day ago

But the dedicated readme section for the JSON bit says if you are using something like llama-server then JSON will be better? I'm using latest llama.cpp but actually connecting via llama-server. Please clarify! 😃

I can confirm from a fair amount of usage that llama-server + Qwen works fine without the extra setting.

evetsagg

about 8 hours ago

In opencode, some times it won't respond after thinking has finished. Any ideas?

laser50

about 3 hours ago

In opencode, some times it won't respond after thinking has finished. Any ideas?

Experienced the same on Llama.cpp, my fallback has been a lot more active (no response after thinking)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment