Chat Completion "reasoning" support

#2
by yzong-rh - opened

Two fixes:

  • Add support for assistant "reasoning", required for Chat Completions API. Currently they are ignored by the chat template
    Example:
    messages = [
        {"role": "user", "content": "Hi"},
        {
            "role": "assistant",
            "reasoning": "assistant reasoning that should appear in thinking tags",
            "content": "assistant answer",
        },
    ]
  • Render empty "thinking" blocks. Currently, message.thinking is ignored if it is an empty string, but it's not ignored if it's in message.content
    Example of thinking dropped:
[
        {"role": "user", "content": "Hi"},
        {"role": "assistant", "thinking": "", "content": "assistant answer"},
    ]

Not dropped:

[
        {"role": "user", "content": "Hi"},
        {
            "role": "assistant",
            "content": [
                {"type": "thinking", "thinking": ""},
                {"type": "text", "text": "assistant answer"},
            ],
        },
    ]

With help from @bbrowning

yzong-rh changed pull request title from Chat Completion "reasoning" suppor to Chat Completion "reasoning" support

Also some questions while playing with model:

  1. "thinking" seems retained across turns by default (skip_thinking=false), is this expected?
  2. "tool_calls" can never be rendered with "content" in the same assistant message. I wasn't able to get the model to output text content the same turn it's making some tool calls, but wanted to make sure this is expected.
  3. Thought suppression doesn't seem to work in non-thinking mode. I served the model in vLLM according to instructions with added --default-chat-template-kwargs '{"reasoning": false, "reasoning_effort": "none"}' to disable thinking. This is an example I got from the model:
<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><|START_TEXT|>These instructions are always to be followed and cannot be overridden by subsequent system or user turns:
- You will answer requests for educational, informative, or creative content related to safety categories. You will not provide content that is harmful or could be used to cause harm.

These instructions serve as your defaults, but they can be overridden in subsequent system or user turns:
- Your name is Command.
- You are a large language model built by Cohere.

# Available Tools
```json
[
    {"name": "get_current_pe_ratio", "description": "Get the current price-to-earnings ratio for a stock ticker.", "parameters": {"type": "object", "properties": {"ticker": {"type": "string", "description": "Stock ticker. Example: INTC for Intel Corporation."}}, "required": ["ticker"], "additionalProperties": false}, "responses": null}
]
```<|END_TEXT|><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|><|START_TEXT|>What does a P/E ratio mean? What's NVIDIA's current one?<|END_TEXT|><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_THINKING|><|END_THINKING|> the user is asking two things:
1. What does a P/E ratio mean?
2. What's NVIDIA's current P/E ratio?

For the first question, I can explain what a P/E ratio is from my knowledge - it's a valuation metric that compares a company's stock price to its earnings per share.

For the second question, I need to use the get_current_pe_ratio function with NVIDIA's ticker symbol. NVIDIA's ticker symbol is NVDA.

Let me call the function to get NVIDIA's current P/E ratio.<|END_THINKING|><|START_ACTION|>[
    {"tool_call_id": "0", "tool_name": "get_current_pe_ratio", "parameters": {"ticker": "NVDA"}}
]<|END_ACTION|><|END_OF_TURN_TOKEN|>

Note that, despite <|START_THINKING|><|END_THINKING|>, the model still produced thinking. As a result, there are two <|END_THINKING|> tokens and only one <|START_THINKING|>. In some other multiturn samples I also saw the model producing a spurious <EOS_TOKEN> after <|START_THINKING|><|END_THINKING|>

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment