MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Abstract
Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution. To tackle these issues, we introduce MME-RealWorld. Specifically, we collect more than 300K images from public datasets and the Internet, filtering 13,366 high-quality images for annotation. This involves the efforts of professional 25 annotators and 7 experts in MLLMs, contributing to 29,429 question-answer pairs that cover 43 subtasks across 5 real-world scenarios, extremely challenging even for humans. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications. We further conduct a thorough evaluation involving 28 prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that even the most advanced models struggle with our benchmarks, where none of them reach 60% accuracy. The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed. The data and evaluation code are released at https://mme-realworld.github.io/ .
Community
We are very proud to launch MME-RealWorld, which contains 13K high-quality images, annotated by more than 30 workers, resulting in 29K question-answer pairs that cover 43 subtasks across five real-world scenarios.
We further conduct a thorough evaluation involving 28 prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that existing models perform poorly in multiple real-world tasks, and there is a significant gap among different models. We hope that MME-RealWorld will further inspire the research community to improve and evolve this field.
incredible!
Real science!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity (2024)
- MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation (2024)
- MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models (2024)
- MIBench: Evaluating Multimodal Large Language Models over Multiple Images (2024)
- MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper