Papers
arxiv:2508.19813

T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

Published on Aug 27, 2025
· Submitted by
Yang Jian
on Sep 2, 2025
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A bilingual benchmark named T2R-bench is proposed to evaluate the performance of large language models in generating reports from tables, highlighting the need for improvement in this task.

AI-generated summary

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.

Community

Paper submitter

😎 T2R-bench: A Benchmark for Generating Article-Level Reports from Real-World Industrial Tables
This paper introduces a new benchmark called T2R-bench, designed to evaluate how well large language models (LLMs) can generate detailed reports from complex industrial tables—a common yet challenging task in real-world applications.

🧩 Problem & Motivation:
While LLMs have improved in tasks like table QA and text-to-SQL, they still struggle with generating accurate, coherent, and insightful reports from diverse and complex industrial tables. Existing benchmarks don’t adequately reflect practical industrial needs.

📊 Dataset Overview:
T2R-bench includes 457 real industrial tables from 19 domains and 4 table types, reflecting high diversity and complexity. Each table is paired with a human-written reference report.

📐 Evaluation Criteria:
The authors propose a comprehensive evaluation framework to measure report quality, focusing on information accuracy, coherence, depth of analysis, and conclusion quality.

🤖 Experimental Insights:
Tests on 25 popular LLMs show that even top models like Deepseek-R1 only achieve 62.71% overall performance, indicating significant room for improvement in real-world table-to-report tasks.

🔮 Conclusion:
T2R-bench fills an important gap in evaluating LLMs for practical industrial report generation. The dataset and code will be released upon publication.

·
Paper submitter

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2508.19813
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.19813 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.19813 in a Space README.md to link it from this page.

Collections including this paper 3