arxiv:2606.12674

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Published on Jun 10

· Submitted by

Leo Y on Jun 12

IBM Research

Upvote

Authors:

Abstract

Evoflux enables compact language models to execute tool workflows more reliably by using evolutionary search to repair failed plans during inference, significantly improving execution feasibility compared to traditional fine-tuning methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

View arXiv page View PDF GitHub 0 Add to collection

Community

LeoYML

Paper submitter about 3 hours ago

Evoflux tackles a practical bottleneck for small (1.5B–4B) tool-using agents: with only a few hundred teacher traces available, should that scarce supervision go into fine-tuning the model's weights, or into search at inference time? The paper reframes compact MCP-style tool use as executable workflow repair rather than one-shot function calling—the agent has to resolve tools from live catalogs, satisfy schemas, preserve cross-step dependencies, and ground its final answer in real execution—and introduces Evoflux, an inference-time evolutionary loop that evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning, all without updating weights. On held-out MCP-Bench tasks spanning live servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17–24% across small planners, while SFT and SFT+DPO trained on the same search-mined data merely match, underperform, or even collapse below the zero-shot baseline—a nice quantification of an underreported risk in small-corpus fine-tuning of compact agents (ReAct hits higher peaks but with far higher variance and token cost). The takeaway is clean: under tight teacher-trace budgets, execution-grounded search transfers more reliably than weight updates.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12674 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12674 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12674 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.