Runtime-adaptive tool-call format: XML on vLLM, Hermes JSON on llama.cpp — no config needed

#52

by Moore2877 - opened 3 days ago

base: refs/heads/main

←

from: refs/pr/52

Discussion Files changed

+55

-3

Runtime-adaptive tool-call format: XML on vLLM, Hermes JSON on llama.cpp — no config needed745c2be7

Moore2877

3 days ago

Follow-up to the PR 45 revert — both ecosystems, one template

The revert was the right call for vLLM (qwen3_coder parser expects XML), but it leaves llama.cpp/ik_llama users with silently-dropped tool calls (their Hermes-style parsers cannot parse the XML body — the <tool_call> opener is consumed by the grammar trigger and the call is dumped into content). Neither format works everywhere. This PR makes the template detect its own runtime and emit the format that engine's parser actually understands:

{%- set tool_call_format = tool_call_format if tool_call_format is defined else ('xml' if cycler is defined else 'json') %}

How detection works: Python Jinja2 (vLLM, SGLang, transformers — all use ImmutableSandboxedEnvironment) defines the built-in globals cycler/joiner/lipsum. minja and llama.cpp's C++ Jinja runtime do not (verified at source: common/jinja/value.cpp registers only raise_exception, namespace, strftime_now). cycler is defined is therefore a zero-cost, side-effect-free engine fingerprint:

Python Jinja2 → XML (matches --tool-call-parser qwen3_coder; identical output to current main)
minja/C++ → Hermes JSON (matches llama.cpp-family parsers)
Explicit override for off-convention setups (e.g. vLLM with the hermes parser): chat_template_kwargs: {"tool_call_format": "json"} / llama.cpp --chat-template-kwargs

A nice property on newer llama.cpp: it derives its tool parser by rendering the template with its own runtime, so the detected branch and the learned parser agree by construction.

Both the instruction block and the assistant-history rendering are format-conditional; the XML branch is your current main verbatim (zero behavior change for vLLM users), and max_tool_arg_chars truncation works in both branches.

Verified:

Dual-runtime render matrix (Jinja2 sandboxed env → XML; same env with cycler/joiner/lipsum removed to emulate minja → JSON; kwarg overrides both directions; truncation both formats; v21.x fixes untouched)
Live end-to-end on ik_llama: auto-selected JSON, model produced a structured tool_calls array with correct arguments (finish_reason: tool_calls)

Small unrelated note: main's template_version still reads qwen3.6-froggeric-v20 while the README announces v21.1 — left untouched here to keep the diff focused, but you may want to bump it.

Thanks for the fast turnaround on the previous series!

Moore2877

3 days ago

Rebased onto v21.2 — the updated <IMPORTANT> wording from your reasoning-bypass fix is now carried into both format branches, and the XML branch remains your current main verbatim. Should merge cleanly now.

Re: the sequential-rebase request from the earlier series — understood for next time; since you resolved those manually this is now a single self-contained PR, so there's nothing else in flight. Re-verified after rebase: dual-runtime render matrix (Jinja2 → XML, minja-emulated → JSON, kwarg overrides both directions) plus a live tool call on ik_llama.

Rebase onto v21.2 (carries the reasoning-bypass <IMPORTANT> wording into both format branches)dbb03ad6

Moore2877 changed pull request status to closed 3 days ago

Moore2877

3 days ago

Superseded by a fresh PR based on current main (v21.2) — the oneline file is single-line so it can never auto-merge once main moves; a clean re-base-PR was the only fix. Closing this one.

froggeric

Owner 3 days ago

You raise a very valid point regarding specific C++ engine setups (like llama-server's internal JSON interception), but implementing an automatic engine-detection switch (cycler is defined) is extremely dangerous for the broader ecosystem.

As you correctly noted, Qwen is trained specifically on <function=name> XML. While llama-server's internal tool parser might struggle with XML, the vast majority of llama.cpp and LM Studio users rely on external frameworks (LangChain, SillyTavern, custom scripts) that simply parse the raw text output. The model's native XML tool format works perfectly fine as raw text. Forcing the model into an off-distribution JSON format fundamentally degrades its intelligence and reasoning loops. If we auto-detected C++ and forced JSON, we would silently ruin the model's performance for the entire C++ ecosystem just to fix llama-server intercept compatibility.

However, your idea for supporting both formats is excellent.

I have just released v21.3, which implements your suggestion as a dedicated opt-in override:
You can now pass --chat-template-kwargs '{"tool_call_format": "json"}' to explicitly instruct the template to output Hermes JSON instead of XML.

Note: In v21.3, we explicitly disable the max_tool_arg_chars truncation feature if you opt into the JSON format. Truncating a serialized JSON string creates structurally malformed JSON that poisons the model's in-context history and crashes JSON parsers. It is much safer to pass the full JSON object.

Thank you for bringing this up! Having the opt-in kwarg is the perfect compromise to support internal interceptors without degrading the default model experience.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment