Instructions to use froggeric/Qwen-Fixed-Chat-Templates with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use froggeric/Qwen-Fixed-Chat-Templates with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen-Fixed-Chat-Templates froggeric/Qwen-Fixed-Chat-Templates
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
chat template v21.3
New version: v21.3
A few fixes from recent PRs:
preserve_thinkingdefaults to true again for better KV cache hits- Narrowed the tool error scanner to the first 80 chars so long successful JSON responses stop triggering false retry loops
- Gated
<|think_off|>tags to user/system messages to prevent prompt injection from tool outputs - The parser now requires
</think>at the start of a line so it stops cutting off messages that just quote the tag - Restored the
\n\nspacing for the non-thinking generation prompt to fix the streaming duplication bug - Added support for Anthropic's
message.thinkingparameter - Reasoning Bypass Hallucination Fix
- Optional JSON Tool Format Kwarg
Big thanks to @Moore2877 @choongng and @batsclamp for the PRs and bug reports.
PS: initial v21 included a PR to switch JSON tool format. I reverted it because it broke vLLM's qwen3_coder parser. Qwen is trained on XML <function=name>, which would have mean no more universal compatibility.
@froggeric - can you please adjust the <IMPORTANT> block to this? there was a discussion earlier with regards to the </think> showing up sometimes as prefix, and one mentioned two lines being adjusted. this is what ended up fixing it for me.
<IMPORTANT>
Reminder:
- You can use the <think></think> block to plan your next tool call OR to synthesize data and formulate your final response to the user.
- ALL explanation and reasoning MUST be placed strictly inside the <think></think> block.
- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags.
- If you choose to call a tool, you MUST output the <tool_call> block IMMEDIATELY after thinking, with NO conversational text before it.
- The <tool_call> and <function> tags MUST be at the very beginning of a new line, with NO spaces or indentation before them.
- To call multiple functions, output a separate, completely closed <tool_call></tool_call> block for EACH function. Do NOT nest <tool_call> blocks.
- If you have all necessary data, provide your final answer directly to the user without any tool call.
</IMPORTANT>
This is a brilliant catch! By removing the explicit </think> token from the imperative rules, we remove the token bias that causes the model to hallucinate it as a prefix when reasoning is disabled.
I have just merged this optimization and released it as v21.2. Thank you for the contribution!
Hey, a (mildly stupid) question; Do you actually suggest turning on tool_call_format="json" in the kwargs when using Llama.cpp? I'm currently not seeing any issues without it, but if it would be better to do so, I will.
Hey, a (mildly stupid) question; Do you actually suggest turning on tool_call_format="json" in the kwargs when using Llama.cpp? I'm currently not seeing any issues without it, but if it would be better to do so, I will.
My build of llama.cpp started grammar correcting the xml and then never running the tool calls recently, but I just started using this template as well. It could have been caused by a recent llama build but my solution was to switch it to JSON.
Hey, a (mildly stupid) question; Do you actually suggest turning on tool_call_format="json" in the kwargs when using Llama.cpp? I'm currently not seeing any issues without it, but if it would be better to do so, I will.
If you are not seeing any issue I would not recommend it. I think recent llama.cpp builds have full support for Qwen XML. I certainly never experienced any problems with it. I used various recent builds of llama.cpp , including custom ones at the beginning for the MTP PRs. I have used the OpenAI and Anthropic endpoints, with Claude Code, Qwen Code, iFlow, with multiples long turns sessions, heavy with tool use.
This is for what I believe are only a few specific cases, with maybe outdated or custom servers or harnesses, which do not yet fully support Qwen XML. I have tried those, but this could include Hermes Agent or LMStudio.
But the dedicated readme section for the JSON bit says if you are using something like llama-server then JSON will be better? I'm using latest llama.cpp but actually connecting via llama-server. Please clarify! 😃
But the dedicated readme section for the JSON bit says if you are using something like llama-server then JSON will be better? I'm using latest llama.cpp but actually connecting via llama-server. Please clarify! 😃
Good point. I have updated the readme
But the dedicated readme section for the JSON bit says if you are using something like llama-server then JSON will be better? I'm using latest llama.cpp but actually connecting via llama-server. Please clarify! 😃
I can confirm from a fair amount of usage that llama-server + Qwen works fine without the extra setting.
In opencode, some times it won't respond after thinking has finished. Any ideas?
In opencode, some times it won't respond after thinking has finished. Any ideas?
Experienced the same on Llama.cpp, my fallback has been a lot more active (no response after thinking)