EgoNormia: Benchmarking Physical Social Norm Understanding
Abstract
Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia |epsilon|, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.
Community
tl;dr
We created a challenging large-scale ego-centric benchmark to test the normative reasoning capabilities of frontier VLMs.
Please check out our leaderboard: https://egonormia.org and blog post https://opensocial.world/aricles/egonormia for more information, code, and data viewer.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models (2025)
- Social Genome: Grounded Social Reasoning Abilities of Multimodal Models (2025)
- Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! (2025)
- MMVU: Measuring Expert-Level Multi-Discipline Video Understanding (2025)
- All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark (2025)
- Evaluating Social Biases in LLM Reasoning (2025)
- Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper