🌍 GroundSet Baseline: LLaVA-1.6 for Spatial Understanding in Earth Observation
This repository hosts the official baseline model for GroundSet, a large-scale Earth Observation dataset grounded in verifiable cadastral vector data.
This baseline is fine-tuned on 1.8 million instructions from GroundSet's finetuning dataset.
🎯 Supported Spatial Tasks
The model has been explicitly trained to handle highly granular semantic categories (135 classes, including specific crop types, heritage sites and civil infrastructure) across the following tasks:
- Captioning: Generating coherent scene descriptions.
- Localized Classification: Classifying given regions (bounding boxes or polygons).
- Object Detection: Localizing specific classes using Horizontal Bounding Boxes (HBB).
- Segmentation: Localizing target classes using polygonal masks.
- Referring Expression Comprehension (REC): Localizing objects based on textual descriptions.
- Visual Question Answering (VQA): Binary verification of object presence.
🏗️ Model Architecture & Training
The model is built upon the LLaVA-1.6 architecture (using the Vicuna 7B language model). This architecture relies on dynamic resolution (AnyRes) for processing high-resolution aerial imagery.
Training Details
- Hardware: Fine-tuned on 8x A100 (80GB) GPUs for 1 epoch (approx. 72 hours).
- Method: Parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation).
- LoRA Config: Rank $r=32$, alpha $\alpha=64$ and dropout $p=0.1$ applied to all linear layers of the language model.
- Projector & Vision Tower: The multi-modal projector was fully fine-tuned, while the vision tower remained frozen.
- Optimization: AdamW optimizer (batch size 128), cosine learning rate scheduler (peak LR $2e^{-4}$, warmup ratio 0.03), BFloat16 precision, DeepSpeed ZeRO-2 and FlashAttention-2.
💡 Note: Pixel coordinates are discretized into [0-1000] bins.
📊 Evaluation & Results
The GroundSet baseline establishes a robust standard for Earth Observation spatial understanding.
Key Performance Highlights
The model is evaluated zero-shot on GroundSet's test set, achieving the following results:
- Classification: 94.18% (Acc@0.8), outperforming Gemini-2.5 Flash (49.84%) and LLaVA-1.6 base (29.20%).
- Segmentation: 39.45 (F1@0.5), outperforming PaliGemma-2 (17.47) and Ferret (13.87).
- Object Detection: 49.47 (F1@0.5), outperforming Remote Sensing specialists like GeoChat (5.52) and SkySenseGPT (3.13).
Cross-Dataset Generalization
To prove generalization capabilities, the model has been evaluated zero-shot on the VRSBench dataset. Despite operating strictly out-of-distribution (VRSBench focuses heavily on vehicles/planes, which are absent from GroundSet's cadastral data), the model still outperformed leading RS-specialists in core spatial grounding tasks like Detection and REC.
💻 Usage Example
Because this model uses the LLaVA-1.6 architecture, it can be easily loaded using the standard Hugging Face transformers library via the LlavaNext implementation.
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
# 1. Load the processor and model
model_id = "RogerFerrod/GroundSet-LLaVA-1.6-7B"
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).to("cuda")
# 2. Load an image and define the question
image = Image.open("path_to_aerial_patch.png")
question = "Detect all instances of Building in this image and provide their bounding boxes."
# 3. Format the prompt using the official chat template
msgs = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question}
]
}
]
prompt = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# 4. Process the inputs
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
# 5. Generate and decode the output
output = model.generate(**inputs, max_new_tokens=100)
generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)
💡 Note: For a complete, reproducible inference pipeline and the exact scripts used to compute the benchmark metrics reported in the paper, please refer to the official GitHub repository.
📝 Citation
If you utilize this model or the associated dataset in your research, please consider citing the original work:
@article{groundset,
title={GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data},
author={Ferrod, Roger and Lecene, Ma{\"e}l and Sapkota, Krishna and Leifman, George and Silverman, Vered and Beryozkin, Genady and Lobry, Sylvain},
journal={arXiv preprint},
year={2026}
}
🙌 Acknowledgements
This work was supported by Google under a research collaboration agreement with Université Paris Cité. The underlying GroundSet dataset leverages official data from IGN (French National Institute of Geographic and Forest Information), specifically BD ORTHO® and BD TOPO®, released under Open Licence 2.0.
- Downloads last month
- 17
Model tree for mckzyl/GroundSet-LLaVA-1.6-7B
Base model
liuhaotian/llava-v1.6-vicuna-7b