arxiv:2607.01647

AgenticDataBench: A Comprehensive Benchmark for Data Agents

Published on Jul 2

· Submitted by

Zhaoyan Sun on Jul 3

Tsinghua University

Upvote

Authors:

Abstract

A comprehensive benchmark named AgenticDataBench is introduced to evaluate data agents across diverse domains with fine-grained task annotations and skill-based coverage metrics.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently, large language model (LLM)-based data agents have emerged as a promising solution to automate data science workflows. However, the field lacks comprehensive benchmarks to rigorously evaluate these agents across diverse scenarios with fine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensive benchmark featuring realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity of data science workflows and the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy in real-world tasks and generate high-quality tasks for domains lacking real data, we introduce data science skills, recurring data-centric operational patterns, and quantify benchmark coverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow using skill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-based task generation approach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-art data agents using our annotated benchmark and open-sourced testbed, providing detailed skill-level insights.

View arXiv page View PDF Project page GitHub 19 Add to collection

Community

curtis-sun

Paper submitter about 1 hour ago

•

edited 33 minutes ago

AgenticDataBench is a comprehensive benchmark for evaluating LLM-based data agents that automate real-world data science workflows.

It addresses the lack of rigorous evaluation by providing diverse and realistic tasks with fine-grained ground-truth labels.

The benchmark includes:

344 tasks
433 data science skills
97 datasets
27.3GB of data

It spans 15 domains, including real-world B2B fintech applications, and is structured around reusable data science skills—core operational patterns extracted from large-scale task solutions.

By combining curated real-world tasks with systematically generated ones, AgenticDataBench ensures broad coverage and minimal redundancy.

It enables both:

Overall task success evaluation
Fine-grained skill-level performance analysis

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2607.01647 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2607.01647 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.