RealWorldQA, What's New?

Community Article Published April 25, 2024

This is a short blog that introduce the RealWorldQA Benchmark.

What is RealWorldQA?

RealWorldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models, contributed by XAI. It assesses how well these models comprehend physical environments. The benchmark consists of 700+ images, each accompanied by a question and a verifiable answer. These images are drawn from real-world scenarios, including those captured from vehicles. The goal is to advance AI models' understanding of our physical world.

Statistics & Info

Name Type #Questions Data Quality* (manually verified 10% samples) Finegrained Classes
RealWorldQA MCQ 765 > 97% No

TL;DR: **RealWorldQA **is a benchmark that requires VLMs to have the capability of:

  1. Recognize details in high-res images (1080p, etc.).
  2. Perform reasoning based on recognition results (may require commonsense knowledge).

*Data Quality: We perform manual verification on 10% samples and check if each sample is correct and unambiguous. Most samples (>97%) in RealWorldQA are good and clear.

Some cases I found ambiguous like:

image/png

  • Question: Where is the dog in relation to the door?
  • Choices: A. The dog is behind the door; B. The dog is next to the door; C. The dog is in front of the door.
  • Answer: A
  • Why ambiguous: The dog is actually between two doors.

image/png

  • Question: How far from the camera is the rightmost vehicle?
  • Choices: A. 15 meters; B. 35 meters; C. 55 meters.
  • Answer: C
  • Why ambiguous: Is the rightmost car that far?

Performance

Questions in RealWorldQA have 2 - 4 candidate choices (the majority have 3 choices), the expectation of RandomGuess Top-1 accuracy would be 37.7%.

We perform the evaluation using VLMEvalKit and list the performance of representative VLMs (proprietarty or opensource) below:

Proprietary Models Acc Proprietary Models Acc
GPT-4v (0409, low-res) 61.4 GPT-4v (0409, high-res) 68.0
GeminiPro-V (1.0) 60.4 QwenVLMax 61.3
OpenSource Models Acc OpenSource Models Acc
InternLM-XComposer2 63.8 InternVL-Chat-V1.5 65.6
IDEFICS2-8B 60.8 LLaVA-NeXT (Yi-34B) 66.0
LLaVA-v1.5 (7B) 54.8 LLaVA-v1.5 (13B) 55.3

Grok-v1.5 is not included since it's not publicly available.

Among the evaluated VLMs, GPT-4v (0409, high-res) achieves the best performance and significantly outperforms its low-res version (remember that RealWorldQA requires fine-grained recognition in high-res images). Meanwhile, top OpenSource VLMs also display competitive performace.

Hard Cases

We select a subset of questions that cannot be correctly answered by all of the Top-3 VLMs (GPT-4v (0409, high-res), InternVL-Chat-V1.5, LLaVA-NeXT (Yi-34B)). The subset includes 101 samples. We visualize several random samples in the subset below.

image/png

  • Question: Is the car closest to us driving in the same direction as us or in the opposite direction from us.
  • Choices: A. Same direction; B. Opposite direction.
  • Answer: B
  • Requirement: 1. Locate the closest car and find its direction; 2. Locate the lane we are in and infer the direction of us.

image/png

  • Question: In which direction is the one-way sign in this scene facing?
  • Choices: A. Left; B. Right
  • Answer: B
  • Requirement: Localize the one-way sign and find its direction

image/png

  • Question: Are there some STOP signs?
  • Choices: A. Yes; B. No
  • Answer: A
  • Requirement: Localize the stop sign (which is extremely small)

image/png

  • Question: How many arrows are pointing right?
  • Choices: A. 2; B. 3; C. 4
  • Answer: B
  • Requirement: Find all arrows on the road sign and recognize their directions

Takeaway

  • RealWorldQA is a benchmark that requires VLMs to have the capability of: 1. Recognize details in high-res images (1080p, etc.); 2. Perform **reasoning based on recognition results **(may require commonsense knowledge)
  • Performance Numbers: Random Guess - 37.7%; Best Proprietary VLM evaluated: GPT-4v (0409, high-res), 68%; Best OpenSource VLM evaluated: LLaVA-NeXT (Yi-34B), 66%
  • You can use VLMEvalKit to evaluate your own VLM on RealWorldQA. Full evaluation results are available at Open VLM Leaderboard.