WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Paper • 2606.09426 • Published 7 days ago • 96
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding? Paper • 2606.08063 • Published 9 days ago • 74
From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning Paper • 2606.07190 • Published 10 days ago • 34
Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs Paper • 2605.30611 • Published 18 days ago • 192