Some practical tips for using Step 3.7 GGUFs

#6
by tarruda - opened

I don't know if this is a bug in llama.cpp implementation, but I have been able to reproduce infinite reasoning loops using the pi agent when using with a local 4-bit GGUF, and so far couldn't reproduce in the official API.

Nevertheless, this bug can be worked with the correct llama.cpp parameters, making it quite usable. I don't know if this will affect model performance significantly, but this is what has been working for me:

llama-server --no-mmap --no-warmup -hf stepfun-ai/Step-3.7-Flash-GGUF:iq4_xs --ctx-size 262144 -np 1 \
   --temp 1.0 --top-p 0.95 \
  --reasoning-budget 16384 \
  --reasoning-budget-message ". Actually, let me stop here. I have been thinking about this for long enough, will just reply now." \
  --spec-type ngram-simple

This will limit the reasoning tokens to 16384, which should be enough for most tasks. If the model reaches that threshold, the message will be appended to the thinking block and the model will reply immediately. It also enables ngram speculative decoding which can speed up reasoning loops significantly.

Thinking budget can also be set on the request with the following body parameters:

{
"thinking_budget_tokens": 16384
}

*Update

I have changed the reasoning-budget message to work better in agentic scenarios. Sometimes the model would do something like "Let me search for xyz..." when the budget is spent. Simply closing the think tag in these cases would still cause the model to emit a tool call (grep in this case).

Sign up or log in to comment