arxiv:2606.23654

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Published on Jun 22

· Submitted by

Kaiyan Zhang on Jun 23

#3 Paper of the day

Frontis AI

Upvote

Authors:

Kaiyan Zhang

Abstract

EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

iseesaw

Paper author Paper submitter about 11 hours ago

Most agent benchmarks use synthetic tasks. EnterpriseClawBench is distilled from a large archive of real proprietary workplace sessions, agents reading heterogeneous files, calling tools, and shipping actual business artifacts, turned into 852 reproducible tasks. We deliberately don't release the data; the reusable contribution is the construction and evaluation protocol, which you can run on your own private sessions. Even the best harness–model config (Codex + GPT-5.5) reaches only 0.663, and EnterpriseClawBench argues a single score hides what matters: harness–model pairing, artifact delivery, cost, runtime, and skill transfer.