Papers
arxiv:2606.28322

PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

Published on Jun 26
· Submitted by
Yana Wei
on Jul 2
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

PerceptionRubrics presents a rubric-based evaluation framework that identifies gaps between benchmark scores and real-world performance through atomic auditing and gated scoring mechanisms.

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.

Community

🚀 Multimodal large language models are showing increasingly saturated scores on perception benchmarks, making it harder to distinguish model rankings. However, in real-world use, they still make “unacceptable” visual mistakes: miscounting objects, misinterpreting spatial relations, missing key numbers in charts, or incorrectly recognizing buttons and text in UI interfaces.

These errors may be "diluted" by conventional average scores, but from a human perspective, a single critical factual error can make the entire response unreliable.

👉 We introduce PerceptionRubrics, a rubric-based evaluation framework for multimodal perception. It automatically decomposes complex image understanding into verifiable atomic visual facts, and designs two types of evaluation criteria together with a corresponding gated scoring metric:

Must-Right: core facts that the model must perceive correctly;
Easy-Wrong: fine-grained details that models are prone to omit, hallucinate, or misinterpret.

👉 Our benchmark contains 1,038 information-dense images and over 10,000 instance-specific rubrics, covering seven domains including natural scenes, OCR documents, GUIs, charts, STEM, logic puzzles, and creative/cultural images. We evaluate 20+ mainstream MLLMs, including GPT-5.5.

👉 Our results show that models can often recognize fragmented details correctly, yet fail to consistently satisfy multiple critical visual constraints. In particular, perceptual reliability remains a major bottleneck in information-dense scenarios such as GUIs, documents, and structured charts.

PerceptionRubrics provides a stricter, more diagnostic evaluation tool that better aligns with human perception, helping the community better understand and improve the visual reliability of multimodal models.
2

3
4

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.28322
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.28322 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.28322 in a Space README.md to link it from this page.

Collections including this paper 1