Abstract
Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with tremendous progress of large multimodal models (LMMs) proprietary and open-source researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, one of which is geolocalization, LMMs struggle. In this work, we propose to solve this problem by introducing a conversational model GAEA that can provide information regarding the location of an image, as required by a user. No large-scale dataset enabling the training of such a model exists. Thus we propose a comprehensive dataset GAEA with 800K images and around 1.6M question answer pairs constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark comprising 4K image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision by 25.69% and the best proprietary model, GPT-4o by 8.28%. Our dataset, model and codes are available
Community
The focused geolocalization models lack geographical understanding of the predicted locations beyond their GPS coordinates. In contrast, despite having the conversational capability, visually and textually prompted large language models (LLMs) and their multimodal variants, popularly referred to as large multimodal models (LMMs), fail to capture fine-grained nuances from an image in specialized downstream tasks such as geolocalization, making their predictions vastly imprecise and worse than random guesses in many cases. Motivated by all these aspects, we propose GAEA, an open-source conversational model with a global-scale geolocalization capability. To the best of our knowledge, this is the first work in the ground-view geolocalization domain that introduces an open-source conversational chatbot, where the user can obtain image geolocalization, relevant description of the image, and engage in a meaningful conversation about the surrounding landmarks, natural attractions, restaurants or coffee shops, medical or emergency facilities, and recreational areas.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework (2025)
- EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering (2025)
- Evaluating Precise Geolocation Inference Capabilities of Vision Language Models (2025)
- Aligning Multimodal LLM with Human Preference: A Survey (2025)
- CommGPT: A Graph and Retrieval-Augmented Multimodal Communication Foundation Model (2025)
- VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search (2025)
- ChatBEV: A Visual Language Model that Understands BEV Maps (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper