EarthExplorer / Tutorial.md
VoyagerXvoyagerx's picture
update roadmap
407bf3f

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Tutorial: EarthEmbeddingExplorer

Background

What is this project about?

EarthEmbeddingExplorer is a tool that lets you search satellite imagery using natural language, images, or geographic locations. In simple terms, you can enter prompts like “a satellite image of a glacier” or “a satellite image of a city with a coastline”, and the system will find places on Earth that match your description and visualize them on a map.

EarthEmbeddingExplorer enables users to explore the Earth in multiple ways without leaving their desk, and it can be useful for many geoscience tasks. For example, geologists can quickly locate glacier regions; biologists can rapidly map forest cover; and architects can study urban patterns across different parts of the world.

How does it work? (Core ideas)

Satellite imagery dataset

We use MajorTOM (Major TOM: Expandable Datasets for Earth Observation) released by the European Space Agency (ESA) [1]. Specifically, we use the Core-S2L2A subset.

Dataset Imagery source Number of samples Sensor type
MajorTOM-Core-S2L2A Sentinel-2 Level 2A 2,245,886 Multispectral

MajorTOM Core-S2L2A provides global Sentinel-2 multispectral imagery (10 m resolution). We convert the RGB bands into embeddings using CLIP-like models (e.g., SigLIP), which saves substantial time because we do not need to preprocess raw imagery ourselves. In addition, embeddings (vectors) are much smaller than raw imagery, and they are significantly faster to search.

To keep EarthEmbeddingExplorer responsive, we build a smaller but representative version of the dataset.

The original tiles in Core-S2L2A are large (1068×1068 pixels), but most AI models expect smaller inputs (384×384 or 224×224 pixels).

  1. Cropping: for simplicity, from each original tile we only take the center 384×384 (or 224×224) crop to generate an embedding.
  2. Uniform sampling: using MajorTOM’s grid coding system, we sample 1% of the data (about 22,000 images). This preserves global coverage while keeping search fast.

Figure 1: Geographic distribution of our sampled satellite image embeddings.

Retrieval models

The core of image retrieval is a family of models known as CLIP (Contrastive Language-Image Pre-training) [2]. We use its improved variants such as SigLIP (Sigmoid Language-Image Pre-training) [3], FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining) [4], and SatCLIP (Satellite Location-Image Pretraining) [5].

An analogy: when teaching a child, you show a picture of a glacier and say “glacier”. After seeing many examples, the child learns to associate the visual concept with the word.

CLIP-like models learn in a similar way, but at much larger scale.

  • An image encoder turns an image into an embedding (a vector of numbers).
  • A text (or location) encoder turns text (or latitude/longitude) into an embedding.

The key property is: if an image matches a text description (or location), their embeddings will be close; otherwise they will be far apart.


Figure 2: How CLIP-like models connect images and text.

The three models we use differ in their encoders and training data:

Model Encoder type Training data
SigLIP image encoder + text encoder natural image–text pairs from the web
FarSLIP image encoder + text encoder satellite image–text pairs
SatCLIP image encoder + location encoder satellite image–location pairs

Figure 3: Converting satellite images into embedding vectors.

In EarthEmbeddingExplorer:

  1. We precompute embeddings for ~22k globally distributed satellite images using SigLIP, FarSLIP, and SatCLIP.
  2. When you provide a query (text like “a satellite image of glacier”, an image, or a location such as (-89, 120)), we encode the query into an embedding using the corresponding encoder.
  3. We compare the query embedding with all image embeddings, visualize similarities on a map, and show the top-5 most similar images.

System architecture


Figure 4: EarthEmbeddingExplorer system architecture on ModelScope.

We deploy EarthEmbeddingExplorer on ModelScope: the models, embedding datasets, and raw imagery datasets are all hosted on the platform. The app runs on xGPU, allowing flexible access to GPU resources and faster retrieval.

How is the raw imagery stored?

MajorTOM Core-S2L2A is large (about 23 TB), so we do not download the full dataset. Instead, the raw imagery is stored as Parquet shards:

  • Shard storage: the dataset is split into many remote Parquet files (shards), each containing a subset of the samples.
  • Columnar storage: different fields/bands (e.g., B04/B03/B02, thumbnail) are stored as separate columns; we only read what we need.
  • Metadata index: we maintain a small index table mapping product_id → (parquet_url, parquet_row) so the system can locate “which shard and which position” contains a given image.

With this design, when a user only needs a small number of images from the retrieval results, the system can use HTTP Range requests to download only a small byte range from a Parquet file (the target row/row group and the requested columns), rather than downloading the full 23 TB dataset—enabling near real-time retrieval of raw images.

What happens when you use the app?

  1. Enter a query: you can enter text, upload an image, or input a latitude/longitude. You can also click on the map to use the clicked location as a query.
  2. Compute similarity: the app encodes your query into an embedding vector and computes similarity scores against all satellite image embeddings.
  3. Show results: the system filters out low-similarity results and shows the highest-scoring locations (and scores) on the map. You can adjust the threshold using a slider.
  4. Download raw images on demand: for the top-5 most similar images, the system looks up their parquet_url and row position via the metadata index, then uses HTTP Range to fetch only the required data (RGB bands) and displays the images quickly in the UI.

Examples


Figure 5: Search by text.


Figure 6: Search by image.


Figure 7: Search by location.

Limitations

While EarthEmbeddingExplorer has strong potential, it also has limitations. SigLIP is primarily trained on “natural images” from the internet (people, pets, cars, everyday objects) rather than satellite imagery. This domain gap can make it harder for the model to understand certain scientific terms or distinctive geographic patterns that are uncommon in typical web photos.

FarSLIP may perform poorly on non-remote-sensing concepts described in text, such as queries like “an image of face”.

Acknowledgements

We thank the following open-source projects and datasets that made EarthEmbeddingExplorer possible:

Models:

  • SigLIP - Vision Transformer model for image-text alignment
  • FarSLIP - Fine-grained satellite image-text pretraining model
  • SatCLIP - Satellite location-image pretraining model

Datasets:

  • MajorTOM - Expandable datasets for Earth observation by ESA

We are grateful to the research communities and organizations that developed and shared these resources.

Contributors

Roadmap

  • Increase the geographical coverage (sample rate) to 1.2% of of the Earth's land surface.
  • Support DINOv2 Embedding model and embedding datasets.
  • Support FAISS for faster similarity search.
  • What features do you want? Leave an issue here!

We warmly welcome new contributors!

References

[1] Francis, A., & Czerkawski, M. (2024). Major TOM: Expandable Datasets for Earth Observation. IGARSS 2024.

[2] Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.

[3] Zhai, X., et al. (2023). Sigmoid Loss for Language-Image Pre-Training. ICCV 2023.

[4] Li, Z., et al. (2025). FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding. arXiv 2025.

[5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.

[6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.