Papers
arxiv:2503.12329

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Published on Mar 16
· Submitted by cckevinn on Mar 19
Authors:
,
,
,
,
,
,
,

Abstract

Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at https://caparena.github.io.

Community

Paper author Paper submitter
  1. The first large-scale human-centered empirical study to benchmark advanced VLMs performance on detailed captioning.
  2. In-depth analysis of existing captioning metrics and VLM-as-a-Judge.
  3. A new efficient automated evaluation benchmark for detailed captioning.
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.12329 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.12329 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1