arxiv:2606.10917

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Published on Jun 9

· Submitted by

wangxucong on Jun 10

Upvote

Authors:

Xucong Wang ,

Abstract

Role-Agent framework enables LLM agents to function as both agent and environment through bootstrapped co-evolution, improving performance via environment-aware reasoning and targeted practice.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role-Agent, black{a framework} that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co-evolution. Role-Agent comprises two synergistic components: World-In-Agent (WIA) and Agent-In-World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment-aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role-Agent consistently improves performance, yielding an average gain of over 4\% over strong baselines.

View arXiv page View PDF GitHub 77 Add to collection

Community

xuc865

Paper author Paper submitter 4 days ago

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution
Github: https://github.com/AMAP-ML/roleagent

xiaochonglinghu

4 days ago

Interesting!

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

1 day ago

the world-in-agent predictive reward is the most interesting bit here, turning the model into its own environment oracle and using state prediction error as a learning signal.

my worry is this could seed a feedback loop: if the model consistently mispredicts outcomes in certain failure modes, those same modes could get reinforced as the planner grows more confident about a wrong world model, especially on longer horizons.

an ablation decoupling the prediction signal with a static env would show how much of the gains come from dynamics versus data reshaping.

btw the arxivlens breakdown helped me parse the method details and i think it nails the dual-role intuition more clearly than many talks on bootstrapped agents: https://arxivlens.com/PaperView/Details/role-agent-bootstrapping-llm-agents-via-dual-role-evolution-4586-fe42c34e

curious how this scales when the environment introduces richer, multi-modal state changes beyond text, and whether a lightweight external cue could keep the loop healthy.

xuc865

Paper author 1 day ago

Thanks for the thoughtful comment!

I agree that, in principle, a self-prediction signal could create a failure loop if the model consistently mispredicts certain outcomes and then becomes more confident in that wrong local world model. In our design, we try to reduce this risk in two ways. First, the predictive reward is not an independent source of positive reward; it only modulates the original task reward. So failed trajectories should not be reinforced just because the model produced a plausible prediction. Second, we use short prediction horizons together with strict structured-output validation and programmatic post-processing, especially to avoid noisy predictions from smaller models contaminating the loop.

I also agree that a decoupled ablation with a static or external environment oracle would be very useful. We have also observed that a stronger frozen external model can improve the absolute performance. In this paper, we intentionally avoid adding an extra model during training to keep the comparison fair, since otherwise the gains may partly come from extra model capacity or external knowledge rather than the dual-role mechanism itself.

For richer multi-modal environments, I think a lightweight external cue could be very helpful, e.g., a VLM state descriptor, object verifier, or consistency checker. The current setting is mostly text-state based, so the validation is cleaner; scaling to visual or embodied environments probably needs this kind of grounding signal to keep the loop healthy.

And thanks for mentioning the ArxivLens breakdown. I’m glad the dual-role intuition came through clearly there :)