Are We Ready for Multi-Image Reasoning? Launching VHs: The Visual Haystacks Benchmark!

Community Article Published July 23, 2024

What Is Visual Question Answering (VQA)?

Humans excel at processing a vast arrays of visual information, a skill that is crucial for achieving artificial general intelligence (AGI). Over decades, AI researchers have developed Visual Question Answering (VQA) systems to interpret scenes within single images and answer related questions. While recent advancements in foundation models have significantly closed the gap between human and machine visual processing, conventional VQA has been restricted to reason about only single images at a time rather than whole collections of visual data.

This limitation poses challenges in more complex scenarios. Take, for example, the challenges of discerning patterns in collections of medical images, monitoring deforestation through satellite imagery, mapping urban changes using autonomous navigation data, analyzing thematic elements across large art collections, or understanding consumer behavior from retail surveillance footage. Each of these scenarios entails not only visual processing across hundreds or thousands of images but also necessitates cross-image processing of these findings. To address these complex yet practical needs, in this project, we study "Multi-Image Question Answering" (MIQA), a more general task that exceeds the reach of most traditional VQA systems.

How to Benchmark VQA Models on MIQA?

The "Needle-In-A-Haystack" (NIAH) challenge has recently become one of the most popular paradigms for benchmarking LLM’s ability to process inputs containing "long contexts", large sets of input data (such as long documents, videos, or hundreds of images). In this task, essential information (“the needle”), which contains the answer to a specific question, is embedded within a vast amount of data (“the haystack”). The system must then retrieve the relevant information and answer the question correctly.

The first NIAH benchmark for visual reasoning was introduced by Google in the Gemini-v1.5 technical report. In this report, they asked their models to retrieve text overlaid on a single frame in a large video. It turns out that existing models perform quite well on this task - primarily due to their strong OCR retrieval capabilities. But what if we ask more visual questions? Do models still perform as well?

visual_haystack_v4

We developed Visual Haystacks (VHs), the first visual centric benchmark that significant challenges all existing Large Multimodal Models (LMMs) in processing long-context visual information.

Caption: When the needle information is replaced with realistic visual content, even under a straightforward question, existing models struggle significantly.

What is the Visual Haystacks (VHs) Benchmark?

In pursuit of evaluating "visual-centric" long-context reasoning capabilities, we introduce the "Visual Haystacks (VHs)" benchmark. This new benchmark is designed to assess Large Multimodal Models (LMMs) in visual retrieval and reasoning across large uncorrelated image sets. VHs features approximately 1K visually binary question-answer pairs, with each set containing anywhere from 1 to 10K images. Unlike previous benchmarks that focused on textual retrieval and reasoning, VHs questions center on identifying the presence of specific visual content, such as objects, utilizing images and annotations from the COCO dataset.

The VHs benchmark is divided into two main challenges, each designed to test the model's ability to accurately locate and analyze relevant images before responding to queries. We have carefully designed the dataset to ensure that guessing or relying on common sense reasoning without viewing the image won't get any advantages (i.e., resulting in a 50% accuracy rate on a binary QA task).

  • Single-Needle Challenge: Only a single needle image exists in the haystack of images. The question is framed as, "For the image with the anchor object, is there a target object?"
  • Multi-Needle Challenge: Two to five needle images exist in the haystack of images. The question is framed as either, "For all images with the anchor object, do all of them contain the target object?" or "For all images with the anchor object, do any of them contain the target object?"

Three Important Findings from VHs

The Visual Haystacks (VHs) benchmark reveals significant challenges faced by current Large Multimodal Models (LMMs) when processing extensive visual inputs. In our experiments across both single and multi-needle modes, we evaluated several open-source and proprietary methods including LLaVA-v1.5, GPT-4o, Claude-3 Opus, and Gemini-v1.5. Additionally, we include a “Captioning” baseline, employing a two-stage approach where images are initially captioned using LLaVA, followed by answering the question using the captions' text content. Below are three pivotal insights:

  1. Struggles with Visual Distractors In single-needle settings, a notable decline in performance was observed as the number of images increased, despite maintaining high oracle accuracy—a scenario absent in prior text-based Gemini-style benchmarks. This shows that existing models may mainly struggle with visual retrieval, especially in the presence of challenging visual distractors. Furthermore, it's crucial to highlight the constraints on open-source LMMs like LLaVA, which can handle only up to three images due to a 2K context length limit. On the other hand, proprietary models such as Gemini-v1.5 and GPT-4o, despite their claims of extended context capabilities, often fail to manage requests when the image count exceeds 1,000, due to payload size limits.

    Screenshot 2024-07-19 at 10.19.44 AM (Caption: Results for several models on VHs for single-needle questions. All models experience significant falloff as the size of the haystack (N) increases, indicating that existing approaches are not robust to complex visual-linguistic processing over long visual contexts. E: Exceeds context length.)

  2. Difficulty Reasoning Across Multiple Images

    Interestingly, all LMM-based methods showed weaker performance with 5+ images in single-image QA and all multi-needle settings compared to a basic approach chaining a captioning model (LLaVA) with an LLM aggregator (LLama-3). This discrepancy suggests that while LLMs are capable of integrating long-context captions effectively, existing LMM-based solutions are inadequate for processing and integrating information across multiple images. Notably, performance significantly deteriorates in multi-image scenarios, with Claude-3 Opus showing disappointing results with only oracle images, and Gemini-1.5/GPT-4o dropping to 50% accuracy with larger sets of 50 images. Screenshot 2024-07-19 at 10.20.00 AM (Captions: Performance on VHs for multi-needle questions. We can see that all visually-aware models perform poorly, indicating that models find it challenging to implicitly integrate visual information. E: Exceeds context length.)

  3. Phenomena in Visual Domain

    Finally, we found that the accuracy of LMMs is hugely affected by the position of the needle image within the input sequence. For instance, LLaVA shows better performance when the needle image is placed immediately before the question, suffering up to a 26.5% drop otherwise. In contrast, proprietary models generally perform better when the image is positioned at the start, experiencing up to a 28.5% decrease when not. This pattern echoes the "lost-in-the-middle" phenomenon seen in the field of Natural Language Processing (NLP), where crucial information positioned at the beginning or end of the context influences model performance. This issue was not evident in previous Gemini-style NIAH challenges, which only required text retrieval and reasoning, underscoring the unique challenges posed by the VHs benchmark. fig2

(Note: These experiments were conducted in April and May, and we have observed some improvements in some proprietary models since then.)

MIRAGE: Our RAG-based Solution to Achieve Improved Performance on VHs

Based on the experimental results above, it is clear that the core challenges of existing solutions in MIQA lie in the ability to (1) accurately retrieve relevant images from a vast pool of potentially unrelated images without positional biases and (2) integrate relevant visual information from these images to correctly answer the question. To address these issues, we introduce an open-source and simple single-stage training paradigm, “MIRAGE” (Multi-Image Retrieval Augmented Generation), which extends the LLaVA model to handle MIQA tasks. The image below shows our model architecture.

Screenshot 2024-07-19 at 10.37.37 AM

Our proposed paradigm consists of several components, each designed to alleviate key issues in the MIQA task:

  1. Compress existing encodings: The MIRAGE paradigm leverages a query-aware compression model to reduce the visual encoder tokens to a smaller subset (10x smaller), allowing for more images in the same context length.
  2. Employ retriever to filter out irrelevant message: MIRAGE uses a retriever trained in-line with the LLM fine-tuning, to predict if an image will be relevant, and dynamically drop irrelevant images.
  3. Multi-Image Training Data: MIRAGE augments existing single-image training data with multi-image reasoning data, and synthetic multi-image reasoning data. Caption: An open-source multi-image QA dataset with more than 1M data!

Results

We revisit the VHs benchmark with MIRAGE. In addition to being capable of handling 1K or 10K images, MIRAGE achieves state-of-the-art performance on most tasks, despite having a weaker single-image QA backbone with only 32 tokens per image!

Screenshot 2024-07-19 at 10.19.24 AM

We also benchmark MIRAGE and other LMM-based models on a variety of VQA tasks. On multi-image tasks, MIRAGE demonstrates strong recall and precision capabilities, significantly outperforming strong competitors like GPT-4, Gemini-v1.5, and the Large World Model (LWM). Additionally, it shows competitive single-image QA performance.

Screenshot 2024-07-19 at 10.23.34 AM

Finally, we compare MIRAGE's co-trained retriever with CLIP. Our retriever performs significantly better than CLIP without losing efficiency. This shows that while CLIP models can be good retrievers for open-vocabulary image retrieval, they may not work well when dealing with question-like texts!

Screenshot 2024-07-19 at 10.25.25 AM

Ready to get started?

Get VHs now on Huggingface, check out our project page, and arxiv paper, and click the star button in our github repo!

Did you find this work useful? Cite us!

@article{wu2024visualhaystacks,
  title={Visual Haystacks: Answering Harder Questions About Sets of Images},
  author={Wu, Tsung-Han and Biamby, Giscard and and Quenum, Jerome and Gupta, Ritwik and Gonzalez, Joseph E and Darrell, Trevor and Chan, David M},
  journal={arXiv preprint arXiv:2407.13766},
  year={2024}
}