arxiv:2606.24597

Qwen-AgentWorld: Language World Models for General Agents

Published on Jun 23

· Submitted by

taesiri on Jun 24

#1 Paper of the day

Qwen

Upvote

Authors:

Abstract

Language-based world models enable agentic environment simulation across multiple domains and enhance general agent performance through scalable simulation and improved downstream task performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environments covering 7 domains via long chain-of-thought reasoning. Leveraging more than 10M environment interaction trajectories of 7 domains in real-world environments, we develop Qwen-AgentWorld through a three-stage training pipeline: CPT injects general-purpose world modeling capabilities from the state transition dynamics and augmented professional corpora, SFT activates next-state-prediction reasoning, and RL sharpens simulation fidelity through a tailored framework with hybrid rubric-and-rule rewards. To evaluate language world models, we present AgentWorldBench, a comprehensive benchmark constructed from real-world interactions of 5 frontier models on 9 established benchmarks. Empirical results demonstrate that Qwen-AgentWorld significantly outperforms existing frontier models. (ii) Beyond foundation models, we further investigate two complementary paradigms through which world modeling enhances general agents. First, as a decoupled environment simulator, Qwen-AgentWorld supports scalable and controllable simulation of thousands of real-world environments for agentic RL, yielding gains that surpass real-environment training alone. Second, as a unified agent foundation model, world-model training acts as a highly effective warm-up that improves downstream performance across 7 agentic benchmarks. Code: https://github.com/QwenLM/Qwen-AgentWorld

View arXiv page View PDF Add to collection

Community

yuxinzuo

about 16 hours ago

https://github.com/QwenLM/Qwen-AgentWorld

Cochon123

about 1 hour ago

very good paper but after reading it i had some complain. i told qwen3.7 about it and there it is his own conclusion :

Bullshitometer: 4.5 / 10 — Real research, optimistic framing.

Problem 1: 50–60% fidelity is too low for clean RL signal

Defense: These are hard out-of-distribution tasks. Beating frontier models at all is progress, and downstream agent gains prove the simulator is useful.

Why it's not enough: RL amplifies distribution shift. At 50% fidelity the agent doesn't learn real environment dynamics — it learns to exploit the simulator's hallucination patterns. They never measure what the agent learned wrong, only the final benchmark score.

Problem 2: They built the benchmark they're evaluated on

Defense: AgentWorldBench is grounded in real interactions from frontier models on established external benchmarks. Ground truth comes from real environments.

Why it's not enough: They still control the rubric design, the scoring dimensions, and the judge prompts. Even with real ground truth, rubric choices silently favor behaviors their model was trained to exhibit. No independent replication is provided.

Problem 3: No ablation linking simulator fidelity to downstream agent gains

Defense: The empirical result speaks for itself — agents trained with LWM outperform those trained only on real environments. The mechanism doesn't need to be fully isolated.

Why it's not enough: Without a fidelity-vs-performance curve, you can't tell whether the gains come from the world model itself, from extra training compute, or from the controlled/adversarial trajectories that any data augmentation strategy could have produced. The core claim is not falsifiable as presented.

Problem 4: The cost argument is hand-waved

Defense: The goal isn't cost reduction — it's controllability and access to environments where real execution is infeasible.

Why it's not enough: Running 397B parameters with long CoT per environment step is orders of magnitude more expensive than a real Docker container. The "infeasible real environment" use case is real but narrow. For most domains they test — terminal, web, SWE — real environments are accessible and far cheaper. No actual cost comparison is provided.

Problem 5: GUI domains dodge the hardest part

Defense: Representing GUI state as accessibility trees is a principled choice that keeps the problem tractable within a language model.

Why it's not enough: Real GUI agents deal with pixel-level rendering bugs, dynamic animations, and states that accessibility trees don't capture. Claiming Android and Web "simulation" while operating on sanitized tree representations avoids the messy core of the problem and likely inflates fidelity scores on those domains.

What I'd want to ask the Qwen team

What's the fidelity-vs-downstream-performance curve? At what simulator accuracy do the RL gains disappear?
Have you measured what the agent learned wrong from the simulator — behaviors that score well in simulation but fail in real environments?
Are simulator errors random or structured? A consistently biased simulator can still be useful. An arbitrarily hallucinating one is just noise for RL.
What's the actual cost per simulated step vs a real containerized environment at RL scale?
Can AgentWorldBench be reproduced by an external team using only your published rubrics, without your internal judge prompts?
For the agent warm-up gains — have you compared against a baseline that trains on the same trajectories as plain next-token prediction, without the world-model objective? Is the framing doing real work, or is it just more data?