arxiv:2312.12436

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

Published on Dec 19, 2023

· Submitted by

akhaliq on Dec 20, 2023

Upvote

Authors:

Chaoyou Fu ,

Renrui Zhang ,

Haojia Lin ,

Yunhang Shen ,

Sirui Zhao ,

Xing Sun ,

Abstract

The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

View arXiv page View PDF Add to collection

Community

puffy310

Dec 20, 2023

It has been less then two weeks.

God y’all are fast.

librarian-bot

Dec 21, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

reddgr

Feb 15

Thank you for sharing. These are insightful test cases for benchmarking and evaluating models as multimodal evolves so rapidly. I believe in this environment, where new models and breakthroughs are coming out every day, we risk becoming too obsessed with comparing models and technologies, compulsively making binary choices ('this response is good' vs. 'that response is bad'; 'this text conveys truth' vs. 'that text lies'; 'this image is "authentic"' vs. 'that image is "fake"'...). These dynamics in the design and deployment of AI tools seem to push to the point where we might no longer know if it's large language models making choices by probabilistic tokenization, or us choosing models based on simplistic quantitative assessments of whether this or that model is 'better' than others. The kind of impact generative AI is making on the way we interact with technology actually raises some deep philosophical questions.

However, we, tech people, love competitions and statistics, so we enjoy benchmarks and leaderboards. I just wanted to share a couple of links that might be interesting to those who liked this paper:

WildVision Arena, a space where you can compare side-by-side responses from GPT-4V, Gemini Pro Vision, and other vision-language chatbots (Llava, CogVLM, Qwen-VL...) and vote: https://huggingface.co/spaces/WildVision/vision-arena

A review of WildVision Arena and a few model comparison use cases I recently discussed on my blog, plus a meme based on my (naturally biased) findings :) https://talkingtochatbots.com/vision