PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception
Abstract
PerceptionRubrics presents a rubric-based evaluation framework that identifies gaps between benchmark scores and real-world performance through atomic auditing and gated scoring mechanisms.
We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.
Community
🚀 Multimodal large language models are showing increasingly saturated scores on perception benchmarks, making it harder to distinguish model rankings. However, in real-world use, they still make “unacceptable” visual mistakes: miscounting objects, misinterpreting spatial relations, missing key numbers in charts, or incorrectly recognizing buttons and text in UI interfaces.
These errors may be "diluted" by conventional average scores, but from a human perspective, a single critical factual error can make the entire response unreliable.
👉 We introduce PerceptionRubrics, a rubric-based evaluation framework for multimodal perception. It automatically decomposes complex image understanding into verifiable atomic visual facts, and designs two types of evaluation criteria together with a corresponding gated scoring metric:
Must-Right: core facts that the model must perceive correctly;
Easy-Wrong: fine-grained details that models are prone to omit, hallucinate, or misinterpret.
👉 Our benchmark contains 1,038 information-dense images and over 10,000 instance-specific rubrics, covering seven domains including natural scenes, OCR documents, GUIs, charts, STEM, logic puzzles, and creative/cultural images. We evaluate 20+ mainstream MLLMs, including GPT-5.5.
👉 Our results show that models can often recognize fragmented details correctly, yet fail to consistently satisfy multiple critical visual constraints. In particular, perceptual reliability remains a major bottleneck in information-dense scenarios such as GUIs, documents, and structured charts.
PerceptionRubrics provides a stricter, more diagnostic evaluation tool that better aligns with human perception, helping the community better understand and improve the visual reliability of multimodal models.
Get this paper in your agent:
hf papers read 2606.28322 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper

