xplato
/

xplato geolocal commited on
Commit
af9bb0b
·
0 Parent(s):

Duplicate from geolocal/StreetCLIP

Browse files

Co-authored-by: Lukas Haas <geolocal@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tflite filter=lfs diff=lfs merge=lfs -text
29
+ *.tgz filter=lfs diff=lfs merge=lfs -text
30
+ *.wasm filter=lfs diff=lfs merge=lfs -text
31
+ *.xz filter=lfs diff=lfs merge=lfs -text
32
+ *.zip filter=lfs diff=lfs merge=lfs -text
33
+ *.zst filter=lfs diff=lfs merge=lfs -text
34
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: zero-shot-image-classification
6
+ widget:
7
+ - src: https://huggingface.co/geolocal/StreetCLIP/resolve/main/nagasaki.jpg
8
+ candidate_labels: China, South Korea, Japan, Phillipines, Taiwan, Vietnam, Cambodia
9
+ example_title: Countries
10
+ - src: https://huggingface.co/geolocal/StreetCLIP/resolve/main/sanfrancisco.jpeg
11
+ candidate_labels: San Jose, San Diego, Los Angeles, Las Vegas, San Francisco, Seattle
12
+ example_title: Cities
13
+ library_name: transformers
14
+ tags:
15
+ - geolocalization
16
+ - geolocation
17
+ - geographic
18
+ - street
19
+ - climate
20
+ - clip
21
+ - urban
22
+ - rural
23
+ - multi-modal
24
+ - geoguessr
25
+ ---
26
+ # Model Card for StreetCLIP
27
+
28
+ StreetCLIP is a robust foundation model for open-domain image geolocalization and other
29
+ geographic and climate-related tasks.
30
+
31
+ Trained on an original dataset of 1.1 million street-level urban and rural geo-tagged images, it achieves
32
+ state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot,
33
+ outperforming supervised models trained on millions of images.
34
+
35
+ # Model Description
36
+
37
+ StreetCLIP is a model pretrained by deriving image captions synthetically from image class labels using
38
+ a domain-specific caption template. This allows StreetCLIP to transfer its generalized zero-shot learning
39
+ capabilities to a specific domain (i.e. the domain of image geolocalization).
40
+ StreetCLIP builds on the OpenAI's pretrained large version of CLIP ViT, using 14x14 pixel
41
+ patches and images with a 336 pixel side length.
42
+
43
+ ## Model Details
44
+
45
+ - **Model type:** [CLIP](https://openai.com/blog/clip/)
46
+ - **Language:** English
47
+ - **License:** Create Commons Attribution Non Commercial 4.0
48
+ - **Trained from model:** [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
49
+
50
+ ## Model Sources
51
+
52
+ - **Paper:** [Preprint](https://arxiv.org/abs/2302.00275)
53
+ - **Cite preprint as:**
54
+ ```bibtex
55
+ @misc{haas2023learning,
56
+ title={Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization},
57
+ author={Lukas Haas and Silas Alberti and Michal Skreta},
58
+ year={2023},
59
+ eprint={2302.00275},
60
+ archivePrefix={arXiv},
61
+ primaryClass={cs.CV}
62
+ }
63
+ ```
64
+
65
+ # Uses
66
+
67
+ StreetCLIP has a deep understanding of the visual features found in street-level urban and rural scenes
68
+ and knows how to relate these concepts to specific countries, regions, and cities. Given its training setup,
69
+ the following use cases are recommended for StreetCLIP.
70
+
71
+ ## Direct Use
72
+
73
+ StreetCLIP can be used out-of-the box using zero-shot learning to infer the geolocation of images on a country, region,
74
+ or city level. Given that StreetCLIP was pretrained on a dataset of street-level urban and rural images,
75
+ the best performance can be expected on images from a similar distribution.
76
+
77
+ Broader direct use cases are any zero-shot image classification tasks that rely on urban and rural street-level
78
+ understanding or geographical information relating visual clues to their region of origin.
79
+
80
+ ## Downstream Use
81
+
82
+ StreetCLIP can be finetuned for any downstream applications that require geographic or street-level urban or rural
83
+ scene understanding. Examples of use cases are the following:
84
+
85
+ **Understanding the Built Environment**
86
+
87
+ - Analyzing building quality
88
+ - Building type classifcation
89
+ - Building energy efficiency Classification
90
+
91
+ **Analyzing Infrastructure**
92
+
93
+ - Analyzing road quality
94
+ - Utility pole maintenance
95
+ - Identifying damage from natural disasters or armed conflicts
96
+
97
+ **Understanding the Natural Environment**
98
+
99
+ - Mapping vegetation
100
+ - Vegetation classification
101
+ - Soil type classifcation
102
+ - Tracking deforestation
103
+
104
+ **General Use Cases**
105
+
106
+ - Street-level image segmentation
107
+ - Urban and rural scene classification
108
+ - Object detection in urban or rural environments
109
+ - Improving navigation and self-driving car technology
110
+
111
+ ## Out-of-Scope Use
112
+
113
+ Any use cases attempting to geolocate users' private images are out-of-scope and discouraged.
114
+
115
+ # Bias, Risks, and Limitations
116
+
117
+ StreetCLIP was not trained on social media images or images of identifable people for a reason. As such, any use case
118
+ attempting to geolocalize users' private images
119
+
120
+ ## Recommendations
121
+ We encourage the community to apply StreetCLIP to applications with significant social impact of which there are many.
122
+ The first three categories of potential use cases under Downstream Use list potential use cases with social impact
123
+ to explore.
124
+
125
+ ## How to Get Started with the Model
126
+
127
+ Use the code below to get started with the model.
128
+
129
+ ```python
130
+ from PIL import Image
131
+ import requests
132
+
133
+ from transformers import CLIPProcessor, CLIPModel
134
+
135
+ model = CLIPModel.from_pretrained("geolocal/StreetCLIP")
136
+ processor = CLIPProcessor.from_pretrained("geolocal/StreetCLIP")
137
+
138
+ url = "https://huggingface.co/geolocal/StreetCLIP/resolve/main/sanfrancisco.jpeg"
139
+ image = Image.open(requests.get(url, stream=True).raw)
140
+
141
+ choices = ["San Jose", "San Diego", "Los Angeles", "Las Vegas", "San Francisco"]
142
+ inputs = processor(text=choices, images=image, return_tensors="pt", padding=True)
143
+
144
+ outputs = model(**inputs)
145
+ logits_per_image = outputs.logits_per_image # this is the image-text similarity score
146
+ probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
147
+ ```
148
+
149
+ # Training Details
150
+
151
+ ## Training Data
152
+
153
+ StreetCLIP was trained on an original, unreleased street-level dataset of 1.1 million real-world,
154
+ urban and rural images. The data used to train the model comes from 101 countries, biased towards
155
+ western countries and not including India and China.
156
+
157
+ ## Preprocessing
158
+
159
+ Same preprocessing as [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336).
160
+
161
+ ## Training Procedure
162
+
163
+ StreetCLIP is initialized with OpenAI's pretrained large version of CLIP ViT and then pretrained using the synthetic
164
+ caption domain-specific pretraining method described in the paper corresponding to this work. StreetCLIP was trained
165
+ for 3 epochs using an AdamW optimizer with a learning rate of 1e-6 on 3 NVIDIA A100 80GB GPUs, a batch size of 32,
166
+ and gradient accumulation of 12 steps.
167
+
168
+ StreetCLIP was trained with the goal of matching images in the batch
169
+ with the caption correponding to the correct city, region, and country of the images' origins.
170
+
171
+ # Evaluation
172
+
173
+ StreetCLIP was evaluated in zero-shot on two open-domain image geolocalization benchmarks using a
174
+ technique called hierarchical linear probing. Hierarchical linear probing sequentially attempts to
175
+ identify the correct country and then city of geographical image origin.
176
+
177
+ ## Testing Data and Metrics
178
+
179
+ ### Testing Data
180
+
181
+ StreetCLIP was evaluated on the following two open-domain image geolocalization benchmarks.
182
+
183
+ * [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
184
+ * [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
185
+
186
+ ### Metrics
187
+
188
+ The objective of the listed benchmark datasets is to predict the images' coordinates of origin with as
189
+ little deviation as possible. A common metric set forth in prior literature is called Percentage at Kilometer (% @ KM).
190
+ The Percentage at Kilometer metric first calculates the distance in kilometers between the predicted coordinates
191
+ to the ground truth coordinates and then looks at what percentage of error distances are below a certain kilometer threshold.
192
+
193
+ ## Results
194
+
195
+ **IM2GPS**
196
+ | Model | 25km | 200km | 750km | 2,500km |
197
+ |----------|:-------------:|:------:|:------:|:------:|
198
+ | PlaNet (2016) | 24.5 | 37.6 | 53.6 | 71.3 |
199
+ | ISNs (2018) | 43.0 | 51.9 | 66.7 | 80.2 |
200
+ | TransLocator (2022) | **48.1** | **64.6** | **75.6** | 86.7 |
201
+ | **Zero-Shot CLIP (ours)** | 27.0 | 42.2 | 71.7 | 86.9 |
202
+ | **Zero-Shot StreetCLIP (ours)** | 28.3 | 45.1 | 74.7 | **88.2** |
203
+ Metric: Percentage at Kilometer (% @ KM)
204
+
205
+ **IM2GPS3K**
206
+ | Model | 25km | 200km | 750km | 2,500km |
207
+ |----------|:-------------:|:------:|:------:|:------:|
208
+ | PlaNet (2016) | 24.8 | 34.3 | 48.4 | 64.6 |
209
+ | ISNs (2018) | 28.0 | 36.6 | 49.7 | 66.0 |
210
+ | TransLocator (2022) | **31.1** | **46.7** | 58.9 | 80.1 |
211
+ | **Zero-Shot CLIP (ours)** | 19.5 | 34.0 | 60.0 | 78.1 |
212
+ | **Zero-Shot StreetCLIP (ours)** | 22.4 | 37.4 | **61.3** | **80.4** |
213
+ Metric: Percentage at Kilometer (% @ KM)
214
+
215
+
216
+ ### Summary
217
+
218
+ Our experiments demonstrate that our synthetic caption pretraining method is capable of significantly
219
+ improving CLIP's generalized zero-shot capabilities applied to open-domain image geolocalization while
220
+ achieving state-of-the-art performance on a selection of benchmark metrics.
221
+
222
+ # Environmental Impact
223
+
224
+ - **Hardware Type:** 4 NVIDIA A100 GPUs
225
+ - **Hours used:** 12
226
+
227
+ # Citation
228
+
229
+ Cite preprint as:
230
+
231
+ ```bibtex
232
+ @misc{haas2023learning,
233
+ title={Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization},
234
+ author={Lukas Haas and Silas Alberti and Michal Skreta},
235
+ year={2023},
236
+ eprint={2302.00275},
237
+ archivePrefix={arXiv},
238
+ primaryClass={cs.CV}
239
+ }
240
+ ```
config.json ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_commit_hash": "ce19dc912ca5cd21c8a653c79e251e808ccabcd1",
3
+ "_name_or_path": "openai/clip-vit-large-patch14-336",
4
+ "architectures": [
5
+ "CLIPModel"
6
+ ],
7
+ "initializer_factor": 1.0,
8
+ "logit_scale_init_value": 2.6592,
9
+ "model_type": "clip",
10
+ "projection_dim": 768,
11
+ "text_config": {
12
+ "_name_or_path": "",
13
+ "add_cross_attention": false,
14
+ "architectures": null,
15
+ "attention_dropout": 0.0,
16
+ "bad_words_ids": null,
17
+ "begin_suppress_tokens": null,
18
+ "bos_token_id": 0,
19
+ "chunk_size_feed_forward": 0,
20
+ "cross_attention_hidden_size": null,
21
+ "decoder_start_token_id": null,
22
+ "diversity_penalty": 0.0,
23
+ "do_sample": false,
24
+ "dropout": 0.0,
25
+ "early_stopping": false,
26
+ "encoder_no_repeat_ngram_size": 0,
27
+ "eos_token_id": 2,
28
+ "exponential_decay_length_penalty": null,
29
+ "finetuning_task": null,
30
+ "forced_bos_token_id": null,
31
+ "forced_eos_token_id": null,
32
+ "hidden_act": "quick_gelu",
33
+ "hidden_size": 768,
34
+ "id2label": {
35
+ "0": "LABEL_0",
36
+ "1": "LABEL_1"
37
+ },
38
+ "initializer_factor": 1.0,
39
+ "initializer_range": 0.02,
40
+ "intermediate_size": 3072,
41
+ "is_decoder": false,
42
+ "is_encoder_decoder": false,
43
+ "label2id": {
44
+ "LABEL_0": 0,
45
+ "LABEL_1": 1
46
+ },
47
+ "layer_norm_eps": 1e-05,
48
+ "length_penalty": 1.0,
49
+ "max_length": 20,
50
+ "max_position_embeddings": 77,
51
+ "min_length": 0,
52
+ "model_type": "clip_text_model",
53
+ "no_repeat_ngram_size": 0,
54
+ "num_attention_heads": 12,
55
+ "num_beam_groups": 1,
56
+ "num_beams": 1,
57
+ "num_hidden_layers": 12,
58
+ "num_return_sequences": 1,
59
+ "output_attentions": false,
60
+ "output_hidden_states": false,
61
+ "output_scores": false,
62
+ "pad_token_id": 1,
63
+ "prefix": null,
64
+ "problem_type": null,
65
+ "projection_dim": 768,
66
+ "pruned_heads": {},
67
+ "remove_invalid_values": false,
68
+ "repetition_penalty": 1.0,
69
+ "return_dict": true,
70
+ "return_dict_in_generate": false,
71
+ "sep_token_id": null,
72
+ "suppress_tokens": null,
73
+ "task_specific_params": null,
74
+ "temperature": 1.0,
75
+ "tf_legacy_loss": false,
76
+ "tie_encoder_decoder": false,
77
+ "tie_word_embeddings": true,
78
+ "tokenizer_class": null,
79
+ "top_k": 50,
80
+ "top_p": 1.0,
81
+ "torch_dtype": null,
82
+ "torchscript": false,
83
+ "transformers_version": "4.23.1",
84
+ "typical_p": 1.0,
85
+ "use_bfloat16": false,
86
+ "vocab_size": 49408
87
+ },
88
+ "text_config_dict": {
89
+ "hidden_size": 768,
90
+ "intermediate_size": 3072,
91
+ "num_attention_heads": 12,
92
+ "num_hidden_layers": 12,
93
+ "projection_dim": 768
94
+ },
95
+ "torch_dtype": "float32",
96
+ "transformers_version": null,
97
+ "vision_config": {
98
+ "_name_or_path": "",
99
+ "add_cross_attention": false,
100
+ "architectures": null,
101
+ "attention_dropout": 0.0,
102
+ "bad_words_ids": null,
103
+ "begin_suppress_tokens": null,
104
+ "bos_token_id": null,
105
+ "chunk_size_feed_forward": 0,
106
+ "cross_attention_hidden_size": null,
107
+ "decoder_start_token_id": null,
108
+ "diversity_penalty": 0.0,
109
+ "do_sample": false,
110
+ "dropout": 0.0,
111
+ "early_stopping": false,
112
+ "encoder_no_repeat_ngram_size": 0,
113
+ "eos_token_id": null,
114
+ "exponential_decay_length_penalty": null,
115
+ "finetuning_task": null,
116
+ "forced_bos_token_id": null,
117
+ "forced_eos_token_id": null,
118
+ "hidden_act": "quick_gelu",
119
+ "hidden_size": 1024,
120
+ "id2label": {
121
+ "0": "LABEL_0",
122
+ "1": "LABEL_1"
123
+ },
124
+ "image_size": 336,
125
+ "initializer_factor": 1.0,
126
+ "initializer_range": 0.02,
127
+ "intermediate_size": 4096,
128
+ "is_decoder": false,
129
+ "is_encoder_decoder": false,
130
+ "label2id": {
131
+ "LABEL_0": 0,
132
+ "LABEL_1": 1
133
+ },
134
+ "layer_norm_eps": 1e-05,
135
+ "length_penalty": 1.0,
136
+ "max_length": 20,
137
+ "min_length": 0,
138
+ "model_type": "clip_vision_model",
139
+ "no_repeat_ngram_size": 0,
140
+ "num_attention_heads": 16,
141
+ "num_beam_groups": 1,
142
+ "num_beams": 1,
143
+ "num_channels": 3,
144
+ "num_hidden_layers": 24,
145
+ "num_return_sequences": 1,
146
+ "output_attentions": false,
147
+ "output_hidden_states": false,
148
+ "output_scores": false,
149
+ "pad_token_id": null,
150
+ "patch_size": 14,
151
+ "prefix": null,
152
+ "problem_type": null,
153
+ "projection_dim": 768,
154
+ "pruned_heads": {},
155
+ "remove_invalid_values": false,
156
+ "repetition_penalty": 1.0,
157
+ "return_dict": true,
158
+ "return_dict_in_generate": false,
159
+ "sep_token_id": null,
160
+ "suppress_tokens": null,
161
+ "task_specific_params": null,
162
+ "temperature": 1.0,
163
+ "tf_legacy_loss": false,
164
+ "tie_encoder_decoder": false,
165
+ "tie_word_embeddings": true,
166
+ "tokenizer_class": null,
167
+ "top_k": 50,
168
+ "top_p": 1.0,
169
+ "torch_dtype": null,
170
+ "torchscript": false,
171
+ "transformers_version": "4.23.1",
172
+ "typical_p": 1.0,
173
+ "use_bfloat16": false
174
+ },
175
+ "vision_config_dict": {
176
+ "hidden_size": 1024,
177
+ "image_size": 336,
178
+ "intermediate_size": 4096,
179
+ "num_attention_heads": 16,
180
+ "num_hidden_layers": 24,
181
+ "patch_size": 14,
182
+ "projection_dim": 768
183
+ }
184
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
nagasaki.jpg ADDED
preprocessor_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 336,
3
+ "do_center_crop": true,
4
+ "do_convert_rgb": true,
5
+ "do_normalize": true,
6
+ "do_resize": true,
7
+ "feature_extractor_type": "CLIPFeatureExtractor",
8
+ "image_mean": [
9
+ 0.48145466,
10
+ 0.4578275,
11
+ 0.40821073
12
+ ],
13
+ "image_std": [
14
+ 0.26862954,
15
+ 0.26130258,
16
+ 0.27577711
17
+ ],
18
+ "processor_class": "CLIPProcessor",
19
+ "resample": 3,
20
+ "size": 336
21
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cf6dc3802a8bf9301560b1aa0cd1fa983b4139f96a1befc43802e387401fe6c0
3
+ size 1711981793
sanfrancisco.jpeg ADDED
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<|endoftext|>",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "do_lower_case": true,
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 77,
22
+ "name_or_path": "openai/clip-vit-large-patch14-336",
23
+ "pad_token": "<|endoftext|>",
24
+ "processor_class": "CLIPProcessor",
25
+ "special_tokens_map_file": "/home/suraj/.cache/huggingface/transformers/18a566598f286c9139f88160c99f84eec492a26bd22738fa9cb44d5b7e0a5c76.cce1206abbad28826f000510f22f354e53e66a97f7c23745a7dfe27609cc07f5",
26
+ "tokenizer_class": "CLIPTokenizer",
27
+ "unk_token": {
28
+ "__type": "AddedToken",
29
+ "content": "<|endoftext|>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ }
35
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff