Model Card for SatCLIP

Here, we provide accompanying information about our model SatCLIP. This repository is for the ViT16-L10 version of the model.

Model Details

Model Description

SatCLIP is a model for contrastive pretraining of satellite image-location pairs. Training is analogous to the popular CLIP model.

Developed by: Konstantin Klemmer, Marc Russwurm, Esther Rolf, Caleb Robinson, Lester Mackey
Model type: Location and image encoder model pretrained using contrastive image-location matching.
License: MIT

Model Sources

Repository: github.com/microsoft/satclip
Paper: https://arxiv.org/abs/2311.17179

Uses

SatCLIP includes an image and a location encoder. The image encoder processes multi-spectral satellite images of size [height, width, 13] into [d]-dimensional latent vectors. The location encoder processes location coordinates [longitude, latitude] into the same [d]-dimensional space.

SatCLIP is a model trained and tested for use in research projects. It is not intended for use in production environments.

Downstream Use

The SatCLIP location encoder learns location characteristics, as captured by the satellite images, and can be deployed for downstream geospatial prediction tasks. Practically, this involves querying the location encoder for the [d]-dimensional vector embedding of all downstream locations and then using that embedding as predictor during downstream learning. In our paper, we show the useability of the learned location embeddings for predicting e.g. population density or biomes.

Use the encoder

from huggingface_hub import hf_hub_download
from load import get_satclip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

c = torch.randn(32, 2)  # Represents a batch of 32 locations (lon/lat)

model = get_satclip(
    hf_hub_download("microsoft/SatCLIP-ViT16-L10", "satclip-vit16-l10.ckpt"),
    device=device,
)  # Only loads location encoder by default
model.eval()
with torch.no_grad():
    emb = model(c.double().to(device)).detach().cpu()

Out-of-Scope Use

Potential use cases of SatCLIP which we did build the model for and did not test for include:

The SatCLIP image encoder can in theory be used for helping with satellite image localization. If this application interests you, we encourage you to check work focusing on this, e.g. Cepeda et al. (2023).
Fine-grained geographic problems (i.e. problems constrained to small geographic areas or including many close locations) are out of scope for SatCLIP. SatCLIP location encoders are pretrained for global-scale use.
Any use outside of research projects is currently out of scope as we don't evaluate SatCLIP in production environments.

Bias, Risks, and Limitations

The following aspects should be considered before using SatCLIP:

SatCLIP is trained with freely available Sentinel-2 satellite imagery with a resolution of 10m per pixel. This allows the model to learn larger structures like cities or mountain ranges, but not small scale structures like individual vehicles or people. SatCLIP models are not applicable for fine-grained geospatial problems.
Location embeddings from SatCLIP only capture location characteristics that represent visually in satellite imagery (at our given resolution). Applications in problems that can not be captured through satellite images are out-of-score for SatCLIP.
Use cases in the defense or surveillance domain are always out-of-scope regardless of performance of SatCLIP. The use of artificial intelligence for such tasks is premature currently given the lack of testing norms and checks to ensure its fair use.

How to Get Started with the Model

Information about how to get started with SatCLIP training and deployment in downstream modelling can be found in our GitHub repository at github.com/microsoft/satclip.

Training Details

Training Data

SatCLIP is trained using the S2-100K dataset which samples 100,000 multi-spectral satellite image scenes from Sentinel-2 via the Microsoft Planetary Computer. Scenes are sampled approximately uniformly over landmass and are only chosen for the dataset if they don't exhibit cloud coverage. More details can be found in our paper.

Training Procedure

SatCLIP is trained via contrastive learning, by matching the correct image-location pairs in a batch of images and locations. Each image and each location is processed within an encoder and trasformed into a [d]-dimensional embedding. The training objective is to minimize the cosine similarity of image and location embeddings.

Training Hyperparameters

The key hyperparameters of SatCLIP are: batch size, learning rate and weight decay. On top of this, the specific location and vision encoder come with their separate hyperparameters. Key hyperparameters for the location encoder include resolution-specific hyperparameters in the positional encoding (e.g. number of Legendre polynomials used for spherical harmonics calculation) and the type, number of layers and capacity of the neural network deployed. For the vision encoder, key hyperparameters depend on the type of vision backbone deployed (e.g. ResNet, Vision Transformer). More details can be found in our paper.

Training Speed

Training SatCLIP for 500 epochs using pretrained vision encoders takes aoughly 2 days on a single A100 GPU.

Evaluation

SatCLIP can be evaluated throughout training and during downstream deployment. During training, we log model loss on a held-out, unseen validation set to monitor the training process for potential overfitting. When SatCLIP embeddings are used in downstream applications, any predictive score can be used for evaluation, e.g. mean squared error (MSE) for regression or accuracy for classification problems.

Citation

BibTeX:

@article{klemmer2023satclip,
  title={SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery},
  author={Klemmer, Konstantin and Rolf, Esther and Robinson, Caleb and Mackey, Lester and Russwurm, Marc},
  journal={TBA},
  year={2023}
}

Model Card Contact

For feedback and comments, contact kklemmer@microsoft.com.