arxiv:2606.28322

PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

Published on Jun 26

· Submitted by

Yana Wei on Jul 2

#1 Paper of the day

Johns Hopkins University

Upvote

Authors:

Abstract

PerceptionRubrics presents a rubric-based evaluation framework that identifies gaps between benchmark scores and real-world performance through atomic auditing and gated scoring mechanisms.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

llwswyn

Paper submitter about 14 hours ago

•

edited about 11 hours ago

🚀 Multimodal large language models are showing increasingly saturated scores on perception benchmarks, making it harder to distinguish model rankings. However, in real-world use, they still make “unacceptable” visual mistakes: miscounting objects, misinterpreting spatial relations, missing key numbers in charts, or incorrectly recognizing buttons and text in UI interfaces.

These errors may be "diluted" by conventional average scores, but from a human perspective, a single critical factual error can make the entire response unreliable.

👉 We introduce PerceptionRubrics, a rubric-based evaluation framework for multimodal perception. It automatically decomposes complex image understanding into verifiable atomic visual facts, and designs two types of evaluation criteria together with a corresponding gated scoring metric:

Must-Right: core facts that the model must perceive correctly;
Easy-Wrong: fine-grained details that models are prone to omit, hallucinate, or misinterpret.

👉 Our benchmark contains 1,038 information-dense images and over 10,000 instance-specific rubrics, covering seven domains including natural scenes, OCR documents, GUIs, charts, STEM, logic puzzles, and creative/cultural images. We evaluate 20+ mainstream MLLMs, including GPT-5.5.

👉 Our results show that models can often recognize fragmented details correctly, yet fail to consistently satisfy multiple critical visual constraints. In particular, perceptual reliability remains a major bottleneck in information-dense scenarios such as GUIs, documents, and structured charts.

PerceptionRubrics provides a stricter, more diagnostic evaluation tool that better aligns with human perception, helping the community better understand and improve the visual reliability of multimodal models.