Duplicate from geolocal/StreetCLIP

Browse files

Co-authored-by: Lukas Haas <geolocal@users.noreply.huggingface.co>

Files changed (12) hide show

.gitattributes +34 -0
README.md +240 -0
config.json +184 -0
merges.txt +0 -0
nagasaki.jpg +0 -0
preprocessor_config.json +21 -0
pytorch_model.bin +3 -0
sanfrancisco.jpeg +0 -0
special_tokens_map.json +24 -0
tokenizer.json +0 -0
tokenizer_config.json +35 -0
vocab.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,34 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,240 @@

+---
+license: cc-by-nc-4.0
+language:
+- en
+pipeline_tag: zero-shot-image-classification
+widget:
+- src: https://huggingface.co/geolocal/StreetCLIP/resolve/main/nagasaki.jpg
+  candidate_labels: China, South Korea, Japan, Phillipines, Taiwan, Vietnam, Cambodia
+  example_title: Countries
+- src: https://huggingface.co/geolocal/StreetCLIP/resolve/main/sanfrancisco.jpeg
+  candidate_labels: San Jose, San Diego, Los Angeles, Las Vegas, San Francisco, Seattle
+  example_title: Cities
+library_name: transformers
+tags:
+- geolocalization
+- geolocation
+- geographic
+- street
+- climate
+- clip
+- urban
+- rural
+- multi-modal
+- geoguessr
+---
+# Model Card for StreetCLIP
+StreetCLIP is a robust foundation model for open-domain image geolocalization and other
+geographic and climate-related tasks.
+Trained on an original dataset of 1.1 million street-level urban and rural geo-tagged images, it achieves
+state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot,
+outperforming supervised models trained on millions of images.
+# Model Description
+StreetCLIP is a model pretrained by deriving image captions synthetically from image class labels using
+a domain-specific caption template. This allows StreetCLIP to transfer its generalized zero-shot learning
+capabilities to a specific domain (i.e. the domain of image geolocalization).
+StreetCLIP builds on the OpenAI's pretrained large version of CLIP ViT, using 14x14 pixel
+patches and images with a 336 pixel side length.
+## Model Details
+- **Model type:** [CLIP](https://openai.com/blog/clip/)
+- **Language:** English
+- **License:** Create Commons Attribution Non Commercial 4.0
+- **Trained from model:** [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
+## Model Sources
+- **Paper:** [Preprint](https://arxiv.org/abs/2302.00275)
+- **Cite preprint as:**
+```bibtex
+  @misc{haas2023learning,
+      title={Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization},
+      author={Lukas Haas and Silas Alberti and Michal Skreta},
+      year={2023},
+      eprint={2302.00275},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+  }
+```
+# Uses
+StreetCLIP has a deep understanding of the visual features found in street-level urban and rural scenes
+and knows how to relate these concepts to specific countries, regions, and cities. Given its training setup,
+the following use cases are recommended for StreetCLIP.
+## Direct Use
+StreetCLIP can be used out-of-the box using zero-shot learning to infer the geolocation of images on a country, region,
+or city level. Given that StreetCLIP was pretrained on a dataset of street-level urban and rural images,
+the best performance can be expected on images from a similar distribution.
+Broader direct use cases are any zero-shot image classification tasks that rely on urban and rural street-level
+understanding or geographical information relating visual clues to their region of origin.
+## Downstream Use
+StreetCLIP can be finetuned for any downstream applications that require geographic or street-level urban or rural
+scene understanding. Examples of use cases are the following:
+**Understanding the Built Environment**
+- Analyzing building quality
+- Building type classifcation
+- Building energy efficiency Classification
+**Analyzing Infrastructure**
+- Analyzing road quality
+- Utility pole maintenance
+- Identifying damage from natural disasters or armed conflicts
+**Understanding the Natural Environment**
+- Mapping vegetation
+- Vegetation classification
+- Soil type classifcation
+- Tracking deforestation
+**General Use Cases**
+- Street-level image segmentation
+- Urban and rural scene classification
+- Object detection in urban or rural environments
+- Improving navigation and self-driving car technology
+## Out-of-Scope Use
+Any use cases attempting to geolocate users' private images are out-of-scope and discouraged.
+# Bias, Risks, and Limitations
+StreetCLIP was not trained on social media images or images of identifable people for a reason. As such, any use case
+attempting to geolocalize users' private images
+## Recommendations
+We encourage the community to apply StreetCLIP to applications with significant social impact of which there are many.
+The first three categories of potential use cases under Downstream Use list potential use cases with social impact
+to explore.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from PIL import Image
+import requests
+from transformers import CLIPProcessor, CLIPModel
+model = CLIPModel.from_pretrained("geolocal/StreetCLIP")
+processor = CLIPProcessor.from_pretrained("geolocal/StreetCLIP")
+url = "https://huggingface.co/geolocal/StreetCLIP/resolve/main/sanfrancisco.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+choices = ["San Jose", "San Diego", "Los Angeles", "Las Vegas", "San Francisco"]
+inputs = processor(text=choices, images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image # this is the image-text similarity score
+probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
+```
+# Training Details
+## Training Data
+StreetCLIP was trained on an original, unreleased street-level dataset of 1.1 million real-world,
+urban and rural images. The data used to train the model comes from 101 countries, biased towards
+western countries and not including India and China.
+## Preprocessing
+Same preprocessing as [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336).
+## Training Procedure
+StreetCLIP is initialized with OpenAI's pretrained large version of CLIP ViT and then pretrained using the synthetic
+caption domain-specific pretraining method described in the paper corresponding to this work. StreetCLIP was trained
+for 3 epochs using an AdamW optimizer with a learning rate of 1e-6 on 3 NVIDIA A100 80GB GPUs, a batch size of 32,
+and gradient accumulation of 12 steps.
+StreetCLIP was trained with the goal of matching images in the batch
+with the caption correponding to the correct city, region, and country of the images' origins.
+# Evaluation
+StreetCLIP was evaluated in zero-shot on two open-domain image geolocalization benchmarks using a
+technique called hierarchical linear probing. Hierarchical linear probing sequentially attempts to
+identify the correct country and then city of geographical image origin.
+## Testing Data and Metrics
+### Testing Data
+StreetCLIP was evaluated on the following two open-domain image geolocalization benchmarks.
+* [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
+* [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
+### Metrics
+The objective of the listed benchmark datasets is to predict the images' coordinates of origin with as
+little deviation as possible. A common metric set forth in prior literature is called Percentage at Kilometer (% @ KM).
+The Percentage at Kilometer metric first calculates the distance in kilometers between the predicted coordinates
+to the ground truth coordinates and then looks at what percentage of error distances are below a certain kilometer threshold.
+## Results
+**IM2GPS**
+| Model  |  25km | 200km  | 750km | 2,500km |
+|----------|:-------------:|:------:|:------:|:------:|
+| PlaNet (2016) |  24.5 | 37.6 | 53.6 | 71.3 |
+| ISNs (2018) |  43.0 | 51.9 | 66.7 | 80.2 |
+| TransLocator (2022) |  **48.1** | **64.6** | **75.6** | 86.7 |
+| **Zero-Shot CLIP (ours)** | 27.0 | 42.2 | 71.7 | 86.9 |
+| **Zero-Shot StreetCLIP (ours)** |  28.3 | 45.1 | 74.7 | **88.2** |
+Metric: Percentage at Kilometer (% @ KM)
+**IM2GPS3K**
+| Model  |  25km | 200km  | 750km | 2,500km |
+|----------|:-------------:|:------:|:------:|:------:|
+| PlaNet (2016) |  24.8 | 34.3 | 48.4 | 64.6 |
+| ISNs (2018) |  28.0 | 36.6 | 49.7 | 66.0 |
+| TransLocator (2022) |  **31.1** | **46.7** | 58.9 | 80.1 |
+| **Zero-Shot CLIP (ours)** | 19.5 | 34.0 | 60.0 | 78.1 |
+| **Zero-Shot StreetCLIP (ours)** |  22.4 | 37.4 | **61.3** | **80.4** |
+Metric: Percentage at Kilometer (% @ KM)
+### Summary
+Our experiments demonstrate that our synthetic caption pretraining method is capable of significantly
+improving CLIP's generalized zero-shot capabilities applied to open-domain image geolocalization while
+achieving state-of-the-art performance on a selection of benchmark metrics.
+# Environmental Impact
+- **Hardware Type:** 4 NVIDIA A100 GPUs
+- **Hours used:** 12
+# Citation
+Cite preprint as:
+```bibtex
+  @misc{haas2023learning,
+      title={Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization},
+      author={Lukas Haas and Silas Alberti and Michal Skreta},
+      year={2023},
+      eprint={2302.00275},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+  }
+```

config.json ADDED Viewed

	@@ -0,0 +1,184 @@

+{
+  "_commit_hash": "ce19dc912ca5cd21c8a653c79e251e808ccabcd1",
+  "_name_or_path": "openai/clip-vit-large-patch14-336",
+  "architectures": [
+    "CLIPModel"
+  ],
+  "initializer_factor": 1.0,
+  "logit_scale_init_value": 2.6592,
+  "model_type": "clip",
+  "projection_dim": 768,
+  "text_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 0,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dropout": 0.0,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 2,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "quick_gelu",
+    "hidden_size": 768,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_factor": 1.0,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 77,
+    "min_length": 0,
+    "model_type": "clip_text_model",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 12,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 12,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 1,
+    "prefix": null,
+    "problem_type": null,
+    "projection_dim": 768,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.23.1",
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "vocab_size": 49408
+  },
+  "text_config_dict": {
+    "hidden_size": 768,
+    "intermediate_size": 3072,
+    "num_attention_heads": 12,
+    "num_hidden_layers": 12,
+    "projection_dim": 768
+  },
+  "torch_dtype": "float32",
+  "transformers_version": null,
+  "vision_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dropout": 0.0,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "quick_gelu",
+    "hidden_size": 1024,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "image_size": 336,
+    "initializer_factor": 1.0,
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "clip_vision_model",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 16,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_channels": 3,
+    "num_hidden_layers": 24,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "patch_size": 14,
+    "prefix": null,
+    "problem_type": null,
+    "projection_dim": 768,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.23.1",
+    "typical_p": 1.0,
+    "use_bfloat16": false
+  },
+  "vision_config_dict": {
+    "hidden_size": 1024,
+    "image_size": 336,
+    "intermediate_size": 4096,
+    "num_attention_heads": 16,
+    "num_hidden_layers": 24,
+    "patch_size": 14,
+    "projection_dim": 768
+  }
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

nagasaki.jpg ADDED Viewed

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "crop_size": 336,
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_resize": true,
+  "feature_extractor_type": "CLIPFeatureExtractor",
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "processor_class": "CLIPProcessor",
+  "resample": 3,
+  "size": 336
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cf6dc3802a8bf9301560b1aa0cd1fa983b4139f96a1befc43802e387401fe6c0
+size 1711981793

sanfrancisco.jpeg ADDED Viewed

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "add_prefix_space": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "do_lower_case": true,
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "errors": "replace",
+  "model_max_length": 77,
+  "name_or_path": "openai/clip-vit-large-patch14-336",
+  "pad_token": "<|endoftext|>",
+  "processor_class": "CLIPProcessor",
+  "special_tokens_map_file": "/home/suraj/.cache/huggingface/transformers/18a566598f286c9139f88160c99f84eec492a26bd22738fa9cb44d5b7e0a5c76.cce1206abbad28826f000510f22f354e53e66a97f7c23745a7dfe27609cc07f5",
+  "tokenizer_class": "CLIPTokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff