Can we add something like Qwen's preserve_thinking to the chat template?

#15
by tarruda - opened

Qwen 3.6 added preserve_thinking template argument which renders reasoning blocks of previous turns when enabled.

I'm curious if this is something we can add to Step 3.x Flash templates or if it requires the model to be trained to use that.

The reason I want this is because it can make much better use of kv cache, which is very important when doing inference on something like Apple Silicon which has slow prompt processing.

For example, I can ask the model to do some investigation in the codebase and it spends a few minutes thinking and doing tool calls. Then when it finishes and I send a follow up message, it will spend quite some time re-processing the context because we suddenly removed all the reasoning content (but kept the tool call/responses, which probably accounts for most of the tokens in the context).

Sign up or log in to comment