Chat template issues with multiple rounds of tool calling

#115
by Kimahriman - opened

The current chat template doesn't work well with multiple rounds of tool calling, especially if a tool call message has content associated with it. I commented on a vLLM issue here, cross posting here for visibility.

There are two issues that seem to be at play here:

  • turn handling across multiple rounds of tool calling
  • location of content from a tool_call message in the prompt (which is already discussed in this discussion)

In the prompt guide for these models, it suggests that content in a tool_calls message is commentary on the result of the tool_responses for those calls, when it reality, at least with the OpenAI chat structure, content in a tool_calls message is commentary on what the tool calls are for and why they are needed before there are any tool_responses. Commentary on the result of the tool_calls would be in a following assistant message after all the role: "tool" messages with the tool responses. I'm not sure what the "legacy" tool_responses behavior is supposed to be or what tools would use that structure so I can't comment on that.

An example of what this looks like (ignoring system prompt and tool definitions):

for messages:

[
  ...,
  {
    "role": "assistant",
    "reasoning": "I should call ToolA to load ToolB",
    "content": "I will start by searching loading ToolB using ToolA",
    "tool_calls": [
      {"id": "call_001", "type": "function", "function": {"name": "ToolA", "arguments": "{\"x\": \"load ToolB\"}"}}
    ]
  },
  {"role": "tool", "tool_call_id": "call_001", "content": "Success: ToolB is now available"}
]

gets rendered as:

...
<|turn>model
<|channel>thought
I should call ToolA to load ToolB<channel|><|tool_call>call:ToolA{x:<|"|>load ToolB<|"|>}<tool_call|><|tool_response>response:ToolA{value:<|"|>Success: ToolB is now available<|"|>}<tool_response|>I will start by loading ToolB using ToolA<turn|>

No new <|turn>model\n gets rendered because the previous message was a tool_response. This seems to cause invalid reasoning to be output in certain cases, particularly with the 26B variant. 31B seems slightly more resilient to this prompt but still can have issues based on our observations.

If there is no content with the tool_calls message, it gets rendered as:

...
<|turn>model
<|channel>thought
I should call ToolA to load ToolB<channel|><|tool_call>call:ToolA{x:<|"|>load ToolB<|"|>}<tool_call|><|tool_response>response:ToolA{value:<|"|>Success: ToolB is now available<|"|>}<tool_response|>

which seems to produce valid reasoning in the response.

Trying out some fixes, simply removing the <turn|> when there is content does not seem to help. Two fixes that do seem to work (anecdotally based on some sample queries and small Claude Code sessions via vLLM):

...
<|turn>model
<|channel>thought
I should call ToolA to load ToolB<channel|>I will start by loading ToolB using ToolA<|tool_call>call:ToolA{x:<|"|>load ToolB<|"|>}<tool_call|><|tool_response>response:ToolA{value:<|"|>Success: ToolB is now available<|"|>}<tool_response|>

The question is which one is more "correct". Are multiple rounds of tool calling supposed to be in a single turn, or should each round be it's own turn? And should the content be moved in the prompt to align with how it is output?

I have frequently experienced cheating session feedback, e.g.,

<|tool_call>call:bash{command:<|"|>python3 /Volumes/Elements/VR-Patent/generate_drawio_files.py && ls ./outputs/*.drawio<|"|>,description:<|"|>Execute the drawIO file generator and verify the output files<|"|>}<tool_call|>

It reported that it had just finished something, resulting in a fake operation. So I have to tell it about it. It politely apologizes and says it will correct it immediately, but similar problems recur every time. This process of repeatedly pointing out problems, making corrections, and repeating mistakes is inhumane.

I think I have just solved this problem. I used the jinja file given in this example https://recipes.vllm.ai/Google/gemma-4-31B-it?hardware=h100. In my case, gemma-4-31B-it performed what it reported, even though I can still see the tool calling chain of thinking. This happens in OpenCode.
But it still fails in Copilot and Codex.

Google org

Hi all,

Thanks for addressing this issue and providing details. We have escalated this issue to our internal team for further investigation.

This has been an ongoing issue for months: https://www.reddit.com/r/LocalLLaMA/comments/1smffwl/issues_with_gemma_4_tool_calling_abrupt_gen/

Gemma 4 is very unreliable for tool calling/agentic use cases in my experience. In almost all cases, it will make at most one successful tool call before responding, even when its chain of thought shows that it wants to make another tool call. It will only make a second tool call if the first one failed with an error. Often times, especially with the 31B model, there is no chain of thought after the first tool call at all, and it just jumps straight into a response.

I'm glad you are escalating this issue, and I hope Google will fix Gemma 4's tool calling issues.

Sign up or log in to comment