Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle Paper • 2606.09376 • Published 5 days ago • 6
Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle Paper • 2606.09376 • Published 5 days ago • 6
SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces Paper • 2606.01317 • Published 13 days ago • 3
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 12 days ago • 54 • 9
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 12 days ago • 54 • 9
FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search Paper • 2606.00660 • Published 14 days ago • 8 • 2
Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems Paper • 2606.00090 • Published 21 days ago • 6 • 3