geolocal commited on
Commit
9db4045
1 Parent(s): ae41d0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -27
README.md CHANGED
@@ -27,52 +27,62 @@ tags:
27
  StreetCLIP is a robust foundation model for open-domain image geolocalization and other
28
  geographic and climate-related tasks.
29
 
30
- Trained on a dataset of 1.1 million geo-tagged images, it achieves state-of-the-art performance
31
- on multiple open-domain image geolocalization benchmarks in zero-shot, outperforming supervised models
32
- trained on millions of images.
33
 
 
34
 
35
- # Model Details
 
 
 
 
36
 
37
- ## Model Description
38
 
39
- <!-- Provide a longer summary of what this model is. -->
40
-
41
-
42
- - **Developed by:** Authors not disclosed
43
  - **Model type:** [CLIP](https://openai.com/blog/clip/)
44
  - **Language:** English
45
  - **License:** Create Commons Attribution Non Commercial 4.0
46
- - **Finetuned from model:** [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
47
 
48
  ## Model Sources
49
 
50
  - **Paper:** Pre-print available soon ...
51
- - **Demo:** Currently in development ...
52
 
53
  # Uses
54
 
55
- To be added soon ...
 
 
56
 
57
  ## Direct Use
58
 
59
- To be added soon ...
 
 
 
 
60
 
61
  ## Downstream Use
62
 
63
- To be added soon ...
 
64
 
65
  ## Out-of-Scope Use
66
 
67
- To be added soon ...
68
 
69
  # Bias, Risks, and Limitations
70
 
71
- To be added soon ...
 
72
 
73
  ## Recommendations
74
-
75
- To be added soon ...
 
 
76
 
77
  ## How to Get Started with the Model
78
 
@@ -102,14 +112,23 @@ probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the lab
102
 
103
  ## Training Data
104
 
105
- StreetCLIP was trained on an undisclosed street-level dataset of 1.1 million real-world,
106
- urban and rural images. The data used to train the model comes from 101 countries.
 
 
 
 
 
107
 
108
  ## Training Procedure
109
 
110
- ### Preprocessing
 
 
 
111
 
112
- Same preprocessing as [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336).
 
113
 
114
  # Evaluation
115
 
@@ -121,12 +140,17 @@ identify the correct country and then city of geographical image origin.
121
 
122
  ### Testing Data
123
 
 
 
124
  * [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
125
  * [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
126
 
127
  ### Metrics
128
 
129
- To be added soon ...
 
 
 
130
 
131
  ## Results
132
 
@@ -143,10 +167,6 @@ achieving SOTA performance on a selection of benchmark metrics.
143
  - **Hardware Type:** 4 NVIDIA A100 GPUs
144
  - **Hours used:** 12
145
 
146
- # Example Image Attribution
147
-
148
- To be added soon ...
149
-
150
  # Citation
151
 
152
  Preprint available soon ...
 
27
  StreetCLIP is a robust foundation model for open-domain image geolocalization and other
28
  geographic and climate-related tasks.
29
 
30
+ Trained on an original dataset of 1.1 million street-level urban and rural geo-tagged images, it achieves
31
+ state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot,
32
+ outperforming supervised models trained on millions of images.
33
 
34
+ # Model Description
35
 
36
+ StreetCLIP is a model pretrained by deriving image captions synthetically from image class labels using
37
+ a domain-specific caption template. This allows StreetCLIP to transfer its generalized zero-shot learning
38
+ capabilities to a specific domain (i.e. the domain of image geolocalization).
39
+ StreetCLIP builds on the OpenAI's pretrained large version of CLIP ViT, using 14x14 pixel
40
+ patches and images with a 336 pixel side length.
41
 
42
+ ## Model Details
43
 
 
 
 
 
44
  - **Model type:** [CLIP](https://openai.com/blog/clip/)
45
  - **Language:** English
46
  - **License:** Create Commons Attribution Non Commercial 4.0
47
+ - **Trained from model:** [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
48
 
49
  ## Model Sources
50
 
51
  - **Paper:** Pre-print available soon ...
 
52
 
53
  # Uses
54
 
55
+ StreetCLIP has a deep understanding of the visual features found in street-level urban and rural scenes
56
+ and knows how to relate these concepts to specific countries, regions, and cities. Given its training setup,
57
+ the following use cases are recommended for StreetCLIP.
58
 
59
  ## Direct Use
60
 
61
+ StreetCLIP can be used out-of-the box using zero-shot learning to infer the geolocation of images on a country, region,
62
+ or city level. Given that StreetCLIP was pretrained on a dataset of stree-level urban and rural images,
63
+ the best performance can be expected on images from a similar distribution.
64
+
65
+ Broader direct use cases
66
 
67
  ## Downstream Use
68
 
69
+ StreetCLIP can be finetuned for any downstream applications that require geographic or street-level urban or rural
70
+ scene understanding.
71
 
72
  ## Out-of-Scope Use
73
 
74
+ Any use cases attempting to geolocate users' private images are out-of-scope and discouraged.
75
 
76
  # Bias, Risks, and Limitations
77
 
78
+ StreetCLIP was not trained on social media images or images of identifable people for a reason. As such, any use case
79
+ attempting to geolocalize users' private images
80
 
81
  ## Recommendations
82
+ We encourage the community to apply StreetCLIP to applications with significant social impact of which there are many.
83
+ Examples include analyzing the built environment (i.e. building quality, type, or energy efficiency classification),
84
+ infrastructure (i.e. road quality, utility pole maintenance, identifying damage from natural disasters), and natural
85
+ environment (i.e. image segmentation, vegetation mapping and classification, tracking deforestation).
86
 
87
  ## How to Get Started with the Model
88
 
 
112
 
113
  ## Training Data
114
 
115
+ StreetCLIP was trained on an original, unreleased street-level dataset of 1.1 million real-world,
116
+ urban and rural images. The data used to train the model comes from 101 countries, biased towards
117
+ western countries and not including India and China.
118
+
119
+ ## Preprocessing
120
+
121
+ Same preprocessing as [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336).
122
 
123
  ## Training Procedure
124
 
125
+ StreetCLIP is initialized with OpenAI's pretrained large version of CLIP ViT and then pretrained using the synthetic
126
+ caption domain-specific pretraining method described in the paper corresponding to this work. StreetCLIP was trained
127
+ for 3 epochs using an AdamW optimizer with a learning rate of 1e-6 on 3 NVIDIA A100 80GB GPUs, a batch size of 32,
128
+ and gradient accumulation of 12 steps.
129
 
130
+ StreetCLIP was trained with the goal of matching images in the batch
131
+ with the caption correponding to the correct city, region, and country of the images' origins.
132
 
133
  # Evaluation
134
 
 
140
 
141
  ### Testing Data
142
 
143
+ StreetCLIP was evaluated on the following two open-domain image geolocalization benchmarks.
144
+
145
  * [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
146
  * [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
147
 
148
  ### Metrics
149
 
150
+ The objective of the listed benchmark datasets is to predict the images' coordinates of origin with as
151
+ little deviation as possible. A common metric set forth in prior literature is called Percentage at Kilometer (% @ KM).
152
+ The Percentage at Kilometer metric first calculates the distance in kilometers between the predicted coordinates
153
+ to the ground truth coordinates and then looks at what percentage of error distances are below a certain kilometer threshold.
154
 
155
  ## Results
156
 
 
167
  - **Hardware Type:** 4 NVIDIA A100 GPUs
168
  - **Hours used:** 12
169
 
 
 
 
 
170
  # Citation
171
 
172
  Preprint available soon ...