davanstrien HF staff commited on
Commit
f41ed77
1 Parent(s): 716226a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - geospatial
4
+ license: mit
5
+ ---
6
+
7
+ # Model Card for SatCLIP
8
+
9
+ Here, we provide accompanying information about our model SatCLIP. This repository is for the ResNet50-L10 version of the model.
10
+
11
+ ## Model Details
12
+
13
+ ### Model Description
14
+
15
+ SatCLIP is a model for contrastive pretraining of satellite image-location pairs. Training is analogous to the popular [CLIP](https://github.com/openai/CLIP) model.
16
+
17
+ - **Developed by:** Konstantin Klemmer, Marc Russwurm, Esther Rolf, Caleb Robinson, Lester Mackey
18
+ - **Model type:** Location and image encoder model pretrained using contrastive image-location matching.
19
+ - **License:** MIT
20
+
21
+ ### Model Sources
22
+
23
+ - **Repository:** [github.com/microsoft/satclip](https://github.com/microsoft/satclip)
24
+ - **Paper:** https://arxiv.org/abs/2311.17179
25
+
26
+ ## Uses
27
+
28
+ SatCLIP includes an *image* and a *location* encoder. The image encoder processes multi-spectral satellite images of size `[height, width, 13]` into `[d]`-dimensional latent vectors. The location encoder processes location coordinates `[longitude, latitude]` into the same `[d]`-dimensional space.
29
+
30
+ SatCLIP is a model trained and tested for use in research projects. It is not intended for use in production environments.
31
+
32
+ ### Downstream Use
33
+
34
+ The SatCLIP location encoder learns location characteristics, as captured by the satellite images, and can be deployed for downstream geospatial prediction tasks. Practically, this involves *querying* the location encoder for the `[d]`-dimensional vector embedding of all downstream locations and then using that embedding as predictor during downstream learning. In our paper, we show the useability of the learned location embeddings for predicting e.g. population density or biomes.
35
+
36
+ #### Use the encoder
37
+
38
+ ```python
39
+ from huggingface_hub import hf_hub_download
40
+ from load import get_satclip
41
+ import torch
42
+
43
+ device = "cuda" if torch.cuda.is_available() else "cpu"
44
+
45
+ c = torch.randn(32, 2) # Represents a batch of 32 locations (lon/lat)
46
+
47
+ model = get_satclip(
48
+ hf_hub_download("microsoft/SatCLIP-ResNet50-L10", "satclip-resnet50-l10.ckpt"),
49
+ device=device,
50
+ ) # Only loads location encoder by default
51
+ model.eval()
52
+ with torch.no_grad():
53
+ emb = model(c.double().to(device)).detach().cpu()
54
+ ```
55
+
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ Potential use cases of SatCLIP which we did build the model for and did not test for include:
60
+ * The SatCLIP image encoder can in theory be used for helping with satellite image localization. If this application interests you, we encourage you to check work focusing on this, e.g. [Cepeda et al. (2023)](https://arxiv.org/abs/2309.16020).
61
+ * Fine-grained geographic problems (i.e. problems constrained to small geographic areas or including many close locations) are out of scope for SatCLIP. SatCLIP location encoders are pretrained for global-scale use.
62
+ * Any use outside of research projects is currently out of scope as we don't evaluate SatCLIP in production environments.
63
+
64
+ ## Bias, Risks, and Limitations
65
+
66
+ The following aspects should be considered before using SatCLIP:
67
+ * SatCLIP is trained with freely available Sentinel-2 satellite imagery with a resolution of 10m per pixel. This allows the model to learn larger structures like cities or mountain ranges, but not small scale structures like individual vehicles or people. SatCLIP models are not applicable for fine-grained geospatial problems.
68
+ * Location embeddings from SatCLIP only capture location characteristics that represent visually in satellite imagery (at our given resolution). Applications in problems that can not be captured through satellite images are out-of-score for SatCLIP.
69
+ * Use cases in the defense or surveillance domain are always out-of-scope regardless of performance of SatCLIP. The use of artificial intelligence for such tasks is premature currently given the lack of testing norms and checks to ensure its fair use.
70
+
71
+ ## How to Get Started with the Model
72
+
73
+ Information about how to get started with SatCLIP training and deployment in downstream modelling can be found in our GitHub repository at [github.com/microsoft/satclip](https://github.com/microsoft/satclip).
74
+
75
+ ## Training Details
76
+
77
+ ### Training Data
78
+
79
+ SatCLIP is trained using the *S2-100K* dataset which samples 100,000 multi-spectral satellite image scenes from Sentinel-2 via the [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/). Scenes are sampled approximately uniformly over landmass and are only chosen for the dataset if they don't exhibit cloud coverage. More details can be found in our paper.
80
+
81
+ ### Training Procedure
82
+
83
+ SatCLIP is trained via contrastive learning, by matching the correct image-location pairs in a batch of images and locations. Each image and each location is processed within an encoder and trasformed into a `[d]`-dimensional embedding. The training objective is to minimize the cosine similarity of image and location embeddings.
84
+
85
+ #### Training Hyperparameters
86
+
87
+ The key hyperparameters of SatCLIP are: batch size, learning rate and weight decay. On top of this, the specific location and vision encoder come with their separate hyperparameters. Key hyperparameters for the location encoder include resolution-specific hyperparameters in the positional encoding (e.g. number of Legendre polynomials used for spherical harmonics calculation) and the type, number of layers and capacity of the neural network deployed. For the vision encoder, key hyperparameters depend on the type of vision backbone deployed (e.g. ResNet, Vision Transformer). More details can be found in our paper.
88
+
89
+ #### Training Speed
90
+
91
+ Training SatCLIP for 500 epochs using pretrained vision encoders takes aoughly 2 days on a single A100 GPU.
92
+
93
+ ## Evaluation
94
+
95
+ SatCLIP can be evaluated throughout training and during downstream deployment. During training, we log model loss on a held-out, unseen validation set to monitor the training process for potential overfitting. When SatCLIP embeddings are used in downstream applications, any predictive score can be used for evaluation, e.g. mean squared error (MSE) for regression or accuracy for classification problems.
96
+
97
+ ## Citation
98
+
99
+ **BibTeX:**
100
+ ```bibtex
101
+ @article{klemmer2023satclip,
102
+ title={SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery},
103
+ author={Klemmer, Konstantin and Rolf, Esther and Robinson, Caleb and Mackey, Lester and Russwurm, Marc},
104
+ journal={TBA},
105
+ year={2023}
106
+ }
107
+ ```
108
+
109
+ ## Model Card Contact
110
+
111
+ For feedback and comments, contact [kklemmer@microsoft.com](mailto:kklemmer@microsoft.com).