turn-the-cam-anonymous commited on
Commit
1ed7deb
1 Parent(s): 51c5412

adding CLIP taming

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. CLIP +0 -1
  2. CLIP/CLIP.png +0 -0
  3. CLIP/LICENSE +22 -0
  4. CLIP/MANIFEST.in +1 -0
  5. CLIP/README.md +199 -0
  6. CLIP/clip.egg-info/PKG-INFO +6 -0
  7. CLIP/clip.egg-info/SOURCES.txt +14 -0
  8. CLIP/clip.egg-info/dependency_links.txt +1 -0
  9. CLIP/clip.egg-info/requires.txt +8 -0
  10. CLIP/clip.egg-info/top_level.txt +1 -0
  11. CLIP/clip/__init__.py +1 -0
  12. CLIP/clip/__pycache__/__init__.cpython-39.pyc +0 -0
  13. CLIP/clip/__pycache__/clip.cpython-39.pyc +0 -0
  14. CLIP/clip/__pycache__/model.cpython-39.pyc +0 -0
  15. CLIP/clip/__pycache__/simple_tokenizer.cpython-39.pyc +0 -0
  16. CLIP/clip/bpe_simple_vocab_16e6.txt.gz +3 -0
  17. CLIP/clip/clip.py +237 -0
  18. CLIP/clip/model.py +436 -0
  19. CLIP/clip/simple_tokenizer.py +132 -0
  20. CLIP/data/country211.md +12 -0
  21. CLIP/data/prompts.md +3401 -0
  22. CLIP/data/rendered-sst2.md +11 -0
  23. CLIP/data/yfcc100m.md +14 -0
  24. CLIP/hubconf.py +42 -0
  25. CLIP/model-card.md +120 -0
  26. CLIP/requirements.txt +5 -0
  27. CLIP/setup.py +21 -0
  28. CLIP/tests/test_consistency.py +25 -0
  29. taming-transformers +0 -1
  30. taming-transformers/.gitignore +1 -0
  31. taming-transformers/License.txt +19 -0
  32. taming-transformers/configs/coco_cond_stage.yaml +49 -0
  33. taming-transformers/configs/coco_scene_images_transformer.yaml +80 -0
  34. taming-transformers/configs/custom_vqgan.yaml +43 -0
  35. taming-transformers/configs/drin_transformer.yaml +77 -0
  36. taming-transformers/configs/faceshq_transformer.yaml +61 -0
  37. taming-transformers/configs/faceshq_vqgan.yaml +42 -0
  38. taming-transformers/configs/imagenet_vqgan.yaml +42 -0
  39. taming-transformers/configs/imagenetdepth_vqgan.yaml +41 -0
  40. taming-transformers/configs/open_images_scene_images_transformer.yaml +86 -0
  41. taming-transformers/configs/sflckr_cond_stage.yaml +43 -0
  42. taming-transformers/data/ade20k_examples.txt +30 -0
  43. taming-transformers/data/ade20k_images/ADE_val_00000123.jpg +0 -0
  44. taming-transformers/data/ade20k_images/ADE_val_00000125.jpg +0 -0
  45. taming-transformers/data/ade20k_images/ADE_val_00000126.jpg +0 -0
  46. taming-transformers/data/ade20k_images/ADE_val_00000203.jpg +0 -0
  47. taming-transformers/data/ade20k_images/ADE_val_00000262.jpg +0 -0
  48. taming-transformers/data/ade20k_images/ADE_val_00000287.jpg +0 -0
  49. taming-transformers/data/ade20k_images/ADE_val_00000289.jpg +0 -0
  50. taming-transformers/data/ade20k_images/ADE_val_00000303.jpg +0 -0
CLIP DELETED
@@ -1 +0,0 @@
1
- Subproject commit a9b1bf5920416aaeaec965c25dd9e8f98c864f16
 
CLIP/CLIP.png ADDED
CLIP/LICENSE ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2021 OpenAI
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
22
+
CLIP/MANIFEST.in ADDED
@@ -0,0 +1 @@
 
1
+ include clip/bpe_simple_vocab_16e6.txt.gz
CLIP/README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLIP
2
+
3
+ [[Blog]](https://openai.com/blog/clip/) [[Paper]](https://arxiv.org/abs/2103.00020) [[Model Card]](model-card.md) [[Colab]](https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb)
4
+
5
+ CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.
6
+
7
+
8
+
9
+ ## Approach
10
+
11
+ ![CLIP](CLIP.png)
12
+
13
+
14
+
15
+ ## Usage
16
+
17
+ First, [install PyTorch 1.7.1](https://pytorch.org/get-started/locally/) (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:
18
+
19
+ ```bash
20
+ $ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
21
+ $ pip install ftfy regex tqdm
22
+ $ pip install git+https://github.com/openai/CLIP.git
23
+ ```
24
+
25
+ Replace `cudatoolkit=11.0` above with the appropriate CUDA version on your machine or `cpuonly` when installing on a machine without a GPU.
26
+
27
+ ```python
28
+ import torch
29
+ import clip
30
+ from PIL import Image
31
+
32
+ device = "cuda" if torch.cuda.is_available() else "cpu"
33
+ model, preprocess = clip.load("ViT-B/32", device=device)
34
+
35
+ image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
36
+ text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
37
+
38
+ with torch.no_grad():
39
+ image_features = model.encode_image(image)
40
+ text_features = model.encode_text(text)
41
+
42
+ logits_per_image, logits_per_text = model(image, text)
43
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
44
+
45
+ print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
46
+ ```
47
+
48
+
49
+ ## API
50
+
51
+ The CLIP module `clip` provides the following methods:
52
+
53
+ #### `clip.available_models()`
54
+
55
+ Returns the names of the available CLIP models.
56
+
57
+ #### `clip.load(name, device=..., jit=False)`
58
+
59
+ Returns the model and the TorchVision transform needed by the model, specified by the model name returned by `clip.available_models()`. It will download the model as necessary. The `name` argument can also be a path to a local checkpoint.
60
+
61
+ The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU. When `jit` is `False`, a non-JIT version of the model will be loaded.
62
+
63
+ #### `clip.tokenize(text: Union[str, List[str]], context_length=77)`
64
+
65
+ Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model
66
+
67
+ ---
68
+
69
+ The model returned by `clip.load()` supports the following methods:
70
+
71
+ #### `model.encode_image(image: Tensor)`
72
+
73
+ Given a batch of images, returns the image features encoded by the vision portion of the CLIP model.
74
+
75
+ #### `model.encode_text(text: Tensor)`
76
+
77
+ Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model.
78
+
79
+ #### `model(image: Tensor, text: Tensor)`
80
+
81
+ Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100.
82
+
83
+
84
+
85
+ ## More Examples
86
+
87
+ ### Zero-Shot Prediction
88
+
89
+ The code below performs zero-shot prediction using CLIP, as shown in Appendix B in the paper. This example takes an image from the [CIFAR-100 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), and predicts the most likely labels among the 100 textual labels from the dataset.
90
+
91
+ ```python
92
+ import os
93
+ import clip
94
+ import torch
95
+ from torchvision.datasets import CIFAR100
96
+
97
+ # Load the model
98
+ device = "cuda" if torch.cuda.is_available() else "cpu"
99
+ model, preprocess = clip.load('ViT-B/32', device)
100
+
101
+ # Download the dataset
102
+ cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
103
+
104
+ # Prepare the inputs
105
+ image, class_id = cifar100[3637]
106
+ image_input = preprocess(image).unsqueeze(0).to(device)
107
+ text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)
108
+
109
+ # Calculate features
110
+ with torch.no_grad():
111
+ image_features = model.encode_image(image_input)
112
+ text_features = model.encode_text(text_inputs)
113
+
114
+ # Pick the top 5 most similar labels for the image
115
+ image_features /= image_features.norm(dim=-1, keepdim=True)
116
+ text_features /= text_features.norm(dim=-1, keepdim=True)
117
+ similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
118
+ values, indices = similarity[0].topk(5)
119
+
120
+ # Print the result
121
+ print("\nTop predictions:\n")
122
+ for value, index in zip(values, indices):
123
+ print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")
124
+ ```
125
+
126
+ The output will look like the following (the exact numbers may be slightly different depending on the compute device):
127
+
128
+ ```
129
+ Top predictions:
130
+
131
+ snake: 65.31%
132
+ turtle: 12.29%
133
+ sweet_pepper: 3.83%
134
+ lizard: 1.88%
135
+ crocodile: 1.75%
136
+ ```
137
+
138
+ Note that this example uses the `encode_image()` and `encode_text()` methods that return the encoded features of given inputs.
139
+
140
+
141
+ ### Linear-probe evaluation
142
+
143
+ The example below uses [scikit-learn](https://scikit-learn.org/) to perform logistic regression on image features.
144
+
145
+ ```python
146
+ import os
147
+ import clip
148
+ import torch
149
+
150
+ import numpy as np
151
+ from sklearn.linear_model import LogisticRegression
152
+ from torch.utils.data import DataLoader
153
+ from torchvision.datasets import CIFAR100
154
+ from tqdm import tqdm
155
+
156
+ # Load the model
157
+ device = "cuda" if torch.cuda.is_available() else "cpu"
158
+ model, preprocess = clip.load('ViT-B/32', device)
159
+
160
+ # Load the dataset
161
+ root = os.path.expanduser("~/.cache")
162
+ train = CIFAR100(root, download=True, train=True, transform=preprocess)
163
+ test = CIFAR100(root, download=True, train=False, transform=preprocess)
164
+
165
+
166
+ def get_features(dataset):
167
+ all_features = []
168
+ all_labels = []
169
+
170
+ with torch.no_grad():
171
+ for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
172
+ features = model.encode_image(images.to(device))
173
+
174
+ all_features.append(features)
175
+ all_labels.append(labels)
176
+
177
+ return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()
178
+
179
+ # Calculate the image features
180
+ train_features, train_labels = get_features(train)
181
+ test_features, test_labels = get_features(test)
182
+
183
+ # Perform logistic regression
184
+ classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
185
+ classifier.fit(train_features, train_labels)
186
+
187
+ # Evaluate using the logistic regression classifier
188
+ predictions = classifier.predict(test_features)
189
+ accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
190
+ print(f"Accuracy = {accuracy:.3f}")
191
+ ```
192
+
193
+ Note that the `C` value should be determined via a hyperparameter sweep using a validation split.
194
+
195
+
196
+ ## See Also
197
+
198
+ * [OpenCLIP](https://github.com/mlfoundations/open_clip): includes larger and independently trained CLIP models up to ViT-G/14
199
+ * [Hugging Face implementation of CLIP](https://huggingface.co/docs/transformers/model_doc/clip): for easier integration with the HF ecosystem
CLIP/clip.egg-info/PKG-INFO ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
1
+ Metadata-Version: 2.1
2
+ Name: clip
3
+ Version: 1.0
4
+ Author: OpenAI
5
+ Provides-Extra: dev
6
+ License-File: LICENSE
CLIP/clip.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ LICENSE
2
+ MANIFEST.in
3
+ README.md
4
+ setup.py
5
+ clip/__init__.py
6
+ clip/bpe_simple_vocab_16e6.txt.gz
7
+ clip/clip.py
8
+ clip/model.py
9
+ clip/simple_tokenizer.py
10
+ clip.egg-info/PKG-INFO
11
+ clip.egg-info/SOURCES.txt
12
+ clip.egg-info/dependency_links.txt
13
+ clip.egg-info/requires.txt
14
+ clip.egg-info/top_level.txt
CLIP/clip.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
1
+
CLIP/clip.egg-info/requires.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
1
+ ftfy
2
+ regex
3
+ tqdm
4
+ torch
5
+ torchvision
6
+
7
+ [dev]
8
+ pytest
CLIP/clip.egg-info/top_level.txt ADDED
@@ -0,0 +1 @@
 
1
+ clip
CLIP/clip/__init__.py ADDED
@@ -0,0 +1 @@
 
1
+ from .clip import *
CLIP/clip/__pycache__/__init__.cpython-39.pyc ADDED
Binary file (201 Bytes). View file
CLIP/clip/__pycache__/clip.cpython-39.pyc ADDED
Binary file (8.81 kB). View file
CLIP/clip/__pycache__/model.cpython-39.pyc ADDED
Binary file (15 kB). View file
CLIP/clip/__pycache__/simple_tokenizer.cpython-39.pyc ADDED
Binary file (5.79 kB). View file
CLIP/clip/bpe_simple_vocab_16e6.txt.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:924691ac288e54409236115652ad4aa250f48203de50a9e4722a6ecd48d6804a
3
+ size 1356917
CLIP/clip/clip.py ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import hashlib
2
+ import os
3
+ import urllib
4
+ import warnings
5
+ from typing import Any, Union, List
6
+ from pkg_resources import packaging
7
+
8
+ import torch
9
+ from PIL import Image
10
+ from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
11
+ from tqdm import tqdm
12
+
13
+ from .model import build_model
14
+ from .simple_tokenizer import SimpleTokenizer as _Tokenizer
15
+
16
+ try:
17
+ from torchvision.transforms import InterpolationMode
18
+ BICUBIC = InterpolationMode.BICUBIC
19
+ except ImportError:
20
+ BICUBIC = Image.BICUBIC
21
+
22
+
23
+ if packaging.version.parse(torch.__version__) < packaging.version.parse("1.7.1"):
24
+ warnings.warn("PyTorch version 1.7.1 or higher is recommended")
25
+
26
+
27
+ __all__ = ["available_models", "load", "tokenize"]
28
+ _tokenizer = _Tokenizer()
29
+
30
+ _MODELS = {
31
+ "RN50": "https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt",
32
+ "RN101": "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt",
33
+ "RN50x4": "https://openaipublic.azureedge.net/clip/models/7e526bd135e493cef0776de27d5f42653e6b4c8bf9e0f653bb11773263205fdd/RN50x4.pt",
34
+ "RN50x16": "https://openaipublic.azureedge.net/clip/models/52378b407f34354e150460fe41077663dd5b39c54cd0bfd2b27167a4a06ec9aa/RN50x16.pt",
35
+ "RN50x64": "https://openaipublic.azureedge.net/clip/models/be1cfb55d75a9666199fb2206c106743da0f6468c9d327f3e0d0a543a9919d9c/RN50x64.pt",
36
+ "ViT-B/32": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
37
+ "ViT-B/16": "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt",
38
+ "ViT-L/14": "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt",
39
+ "ViT-L/14@336px": "https://openaipublic.azureedge.net/clip/models/3035c92b350959924f9f00213499208652fc7ea050643e8b385c2dac08641f02/ViT-L-14-336px.pt",
40
+ }
41
+
42
+
43
+ def _download(url: str, root: str):
44
+ os.makedirs(root, exist_ok=True)
45
+ filename = os.path.basename(url)
46
+
47
+ expected_sha256 = url.split("/")[-2]
48
+ download_target = os.path.join(root, filename)
49
+
50
+ if os.path.exists(download_target) and not os.path.isfile(download_target):
51
+ raise RuntimeError(f"{download_target} exists and is not a regular file")
52
+
53
+ if os.path.isfile(download_target):
54
+ if hashlib.sha256(open(download_target, "rb").read()).hexdigest() == expected_sha256:
55
+ return download_target
56
+ else:
57
+ warnings.warn(f"{download_target} exists, but the SHA256 checksum does not match; re-downloading the file")
58
+
59
+ with urllib.request.urlopen(url) as source, open(download_target, "wb") as output:
60
+ with tqdm(total=int(source.info().get("Content-Length")), ncols=80, unit='iB', unit_scale=True, unit_divisor=1024) as loop:
61
+ while True:
62
+ buffer = source.read(8192)
63
+ if not buffer:
64
+ break
65
+
66
+ output.write(buffer)
67
+ loop.update(len(buffer))
68
+
69
+ if hashlib.sha256(open(download_target, "rb").read()).hexdigest() != expected_sha256:
70
+ raise RuntimeError("Model has been downloaded but the SHA256 checksum does not not match")
71
+
72
+ return download_target
73
+
74
+
75
+ def _convert_image_to_rgb(image):
76
+ return image.convert("RGB")
77
+
78
+
79
+ def _transform(n_px):
80
+ return Compose([
81
+ Resize(n_px, interpolation=BICUBIC),
82
+ CenterCrop(n_px),
83
+ _convert_image_to_rgb,
84
+ ToTensor(),
85
+ Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
86
+ ])
87
+
88
+
89
+ def available_models() -> List[str]:
90
+ """Returns the names of available CLIP models"""
91
+ return list(_MODELS.keys())
92
+
93
+
94
+ def load(name: str, device: Union[str, torch.device] = "cuda" if torch.cuda.is_available() else "cpu", jit: bool = False, download_root: str = None):
95
+ """Load a CLIP model
96
+
97
+ Parameters
98
+ ----------
99
+ name : str
100
+ A model name listed by `clip.available_models()`, or the path to a model checkpoint containing the state_dict
101
+
102
+ device : Union[str, torch.device]
103
+ The device to put the loaded model
104
+
105
+ jit : bool
106
+ Whether to load the optimized JIT model or more hackable non-JIT model (default).
107
+
108
+ download_root: str
109
+ path to download the model files; by default, it uses "~/.cache/clip"
110
+
111
+ Returns
112
+ -------
113
+ model : torch.nn.Module
114
+ The CLIP model
115
+
116
+ preprocess : Callable[[PIL.Image], torch.Tensor]
117
+ A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input
118
+ """
119
+ if name in _MODELS:
120
+ model_path = _download(_MODELS[name], download_root or os.path.expanduser("~/.cache/clip"))
121
+ elif os.path.isfile(name):
122
+ model_path = name
123
+ else:
124
+ raise RuntimeError(f"Model {name} not found; available models = {available_models()}")
125
+
126
+ with open(model_path, 'rb') as opened_file:
127
+ try:
128
+ # loading JIT archive
129
+ model = torch.jit.load(opened_file, map_location=device if jit else "cpu").eval()
130
+ state_dict = None
131
+ except RuntimeError:
132
+ # loading saved state dict
133
+ if jit:
134
+ warnings.warn(f"File {model_path} is not a JIT archive. Loading as a state dict instead")
135
+ jit = False
136
+ state_dict = torch.load(opened_file, map_location="cpu")
137
+
138
+ if not jit:
139
+ model = build_model(state_dict or model.state_dict()).to(device)
140
+ if str(device) == "cpu":
141
+ model.float()
142
+ return model, _transform(model.visual.input_resolution)
143
+
144
+ # patch the device names
145
+ device_holder = torch.jit.trace(lambda: torch.ones([]).to(torch.device(device)), example_inputs=[])
146
+ device_node = [n for n in device_holder.graph.findAllNodes("prim::Constant") if "Device" in repr(n)][-1]
147
+
148
+ def patch_device(module):
149
+ try:
150
+ graphs = [module.graph] if hasattr(module, "graph") else []
151
+ except RuntimeError:
152
+ graphs = []
153
+
154
+ if hasattr(module, "forward1"):
155
+ graphs.append(module.forward1.graph)
156
+
157
+ for graph in graphs:
158
+ for node in graph.findAllNodes("prim::Constant"):
159
+ if "value" in node.attributeNames() and str(node["value"]).startswith("cuda"):
160
+ node.copyAttributes(device_node)
161
+
162
+ model.apply(patch_device)
163
+ patch_device(model.encode_image)
164
+ patch_device(model.encode_text)
165
+
166
+ # patch dtype to float32 on CPU
167
+ if str(device) == "cpu":
168
+ float_holder = torch.jit.trace(lambda: torch.ones([]).float(), example_inputs=[])
169
+ float_input = list(float_holder.graph.findNode("aten::to").inputs())[1]
170
+ float_node = float_input.node()
171
+
172
+ def patch_float(module):
173
+ try:
174
+ graphs = [module.graph] if hasattr(module, "graph") else []
175
+ except RuntimeError:
176
+ graphs = []
177
+
178
+ if hasattr(module, "forward1"):
179
+ graphs.append(module.forward1.graph)
180
+
181
+ for graph in graphs:
182
+ for node in graph.findAllNodes("aten::to"):
183
+ inputs = list(node.inputs())
184
+ for i in [1, 2]: # dtype can be the second or third argument to aten::to()
185
+ if inputs[i].node()["value"] == 5:
186
+ inputs[i].node().copyAttributes(float_node)
187
+
188
+ model.apply(patch_float)
189
+ patch_float(model.encode_image)
190
+ patch_float(model.encode_text)
191
+
192
+ model.float()
193
+
194
+ return model, _transform(model.input_resolution.item())
195
+
196
+
197
+ def tokenize(texts: Union[str, List[str]], context_length: int = 77, truncate: bool = False) -> Union[torch.IntTensor, torch.LongTensor]:
198
+ """
199
+ Returns the tokenized representation of given input string(s)
200
+
201
+ Parameters
202
+ ----------
203
+ texts : Union[str, List[str]]
204
+ An input string or a list of input strings to tokenize
205
+
206
+ context_length : int
207
+ The context length to use; all CLIP models use 77 as the context length
208
+
209
+ truncate: bool
210
+ Whether to truncate the text in case its encoding is longer than the context length
211
+
212
+ Returns
213
+ -------
214
+ A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length].
215
+ We return LongTensor when torch version is <1.8.0, since older index_select requires indices to be long.
216
+ """
217
+ if isinstance(texts, str):
218
+ texts = [texts]
219
+
220
+ sot_token = _tokenizer.encoder["<|startoftext|>"]
221
+ eot_token = _tokenizer.encoder["<|endoftext|>"]
222
+ all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]
223
+ if packaging.version.parse(torch.__version__) < packaging.version.parse("1.8.0"):
224
+ result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)
225
+ else:
226
+ result = torch.zeros(len(all_tokens), context_length, dtype=torch.int)
227
+
228
+ for i, tokens in enumerate(all_tokens):
229
+ if len(tokens) > context_length:
230
+ if truncate:
231
+ tokens = tokens[:context_length]
232
+ tokens[-1] = eot_token
233
+ else:
234
+ raise RuntimeError(f"Input {texts[i]} is too long for context length {context_length}")
235
+ result[i, :len(tokens)] = torch.tensor(tokens)
236
+
237
+ return result
CLIP/clip/model.py ADDED
@@ -0,0 +1,436 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from collections import OrderedDict
2
+ from typing import Tuple, Union
3
+
4
+ import numpy as np
5
+ import torch
6
+ import torch.nn.functional as F
7
+ from torch import nn
8
+
9
+
10
+ class Bottleneck(nn.Module):
11
+ expansion = 4
12
+
13
+ def __init__(self, inplanes, planes, stride=1):
14
+ super().__init__()
15
+
16
+ # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
17
+ self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
18
+ self.bn1 = nn.BatchNorm2d(planes)
19
+ self.relu1 = nn.ReLU(inplace=True)
20
+
21
+ self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
22
+ self.bn2 = nn.BatchNorm2d(planes)
23
+ self.relu2 = nn.ReLU(inplace=True)
24
+
25
+ self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()
26
+
27
+ self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
28
+ self.bn3 = nn.BatchNorm2d(planes * self.expansion)
29
+ self.relu3 = nn.ReLU(inplace=True)
30
+
31
+ self.downsample = None
32
+ self.stride = stride
33
+
34
+ if stride > 1 or inplanes != planes * Bottleneck.expansion:
35
+ # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
36
+ self.downsample = nn.Sequential(OrderedDict([
37
+ ("-1", nn.AvgPool2d(stride)),
38
+ ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
39
+ ("1", nn.BatchNorm2d(planes * self.expansion))
40
+ ]))
41
+
42
+ def forward(self, x: torch.Tensor):
43
+ identity = x
44
+
45
+ out = self.relu1(self.bn1(self.conv1(x)))
46
+ out = self.relu2(self.bn2(self.conv2(out)))
47
+ out = self.avgpool(out)
48
+ out = self.bn3(self.conv3(out))
49
+
50
+ if self.downsample is not None:
51
+ identity = self.downsample(x)
52
+
53
+ out += identity
54
+ out = self.relu3(out)
55
+ return out
56
+
57
+
58
+ class AttentionPool2d(nn.Module):
59
+ def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):
60
+ super().__init__()
61
+ self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5)
62
+ self.k_proj = nn.Linear(embed_dim, embed_dim)
63
+ self.q_proj = nn.Linear(embed_dim, embed_dim)
64
+ self.v_proj = nn.Linear(embed_dim, embed_dim)
65
+ self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
66
+ self.num_heads = num_heads
67
+
68
+ def forward(self, x):
69
+ x = x.flatten(start_dim=2).permute(2, 0, 1) # NCHW -> (HW)NC
70
+ x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0) # (HW+1)NC
71
+ x = x + self.positional_embedding[:, None, :].to(x.dtype) # (HW+1)NC
72
+ x, _ = F.multi_head_attention_forward(
73
+ query=x[:1], key=x, value=x,
74
+ embed_dim_to_check=x.shape[-1],
75
+ num_heads=self.num_heads,
76
+ q_proj_weight=self.q_proj.weight,
77
+ k_proj_weight=self.k_proj.weight,
78
+ v_proj_weight=self.v_proj.weight,
79
+ in_proj_weight=None,
80
+ in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
81
+ bias_k=None,
82
+ bias_v=None,
83
+ add_zero_attn=False,
84
+ dropout_p=0,
85
+ out_proj_weight=self.c_proj.weight,
86
+ out_proj_bias=self.c_proj.bias,
87
+ use_separate_proj_weight=True,
88
+ training=self.training,
89
+ need_weights=False
90
+ )
91
+ return x.squeeze(0)
92
+
93
+
94
+ class ModifiedResNet(nn.Module):
95
+ """
96
+ A ResNet class that is similar to torchvision's but contains the following changes:
97
+ - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
98
+ - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
99
+ - The final pooling layer is a QKV attention instead of an average pool
100
+ """
101
+
102
+ def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
103
+ super().__init__()
104
+ self.output_dim = output_dim
105
+ self.input_resolution = input_resolution
106
+
107
+ # the 3-layer stem
108
+ self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
109
+ self.bn1 = nn.BatchNorm2d(width // 2)
110
+ self.relu1 = nn.ReLU(inplace=True)
111
+ self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
112
+ self.bn2 = nn.BatchNorm2d(width // 2)
113
+ self.relu2 = nn.ReLU(inplace=True)
114
+ self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
115
+ self.bn3 = nn.BatchNorm2d(width)
116
+ self.relu3 = nn.ReLU(inplace=True)
117
+ self.avgpool = nn.AvgPool2d(2)
118
+
119
+ # residual layers
120
+ self._inplanes = width # this is a *mutable* variable used during construction
121
+ self.layer1 = self._make_layer(width, layers[0])
122
+ self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
123
+ self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
124
+ self.layer4 = self._make_layer(width * 8, layers[3], stride=2)
125
+
126
+ embed_dim = width * 32 # the ResNet feature dimension
127
+ self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)
128
+
129
+ def _make_layer(self, planes, blocks, stride=1):
130
+ layers = [Bottleneck(self._inplanes, planes, stride)]
131
+
132
+ self._inplanes = planes * Bottleneck.expansion
133
+ for _ in range(1, blocks):
134
+ layers.append(Bottleneck(self._inplanes, planes))
135
+
136
+ return nn.Sequential(*layers)
137
+
138
+ def forward(self, x):
139
+ def stem(x):
140
+ x = self.relu1(self.bn1(self.conv1(x)))
141
+ x = self.relu2(self.bn2(self.conv2(x)))
142
+ x = self.relu3(self.bn3(self.conv3(x)))
143
+ x = self.avgpool(x)
144
+ return x
145
+
146
+ x = x.type(self.conv1.weight.dtype)
147
+ x = stem(x)
148
+ x = self.layer1(x)
149
+ x = self.layer2(x)
150
+ x = self.layer3(x)
151
+ x = self.layer4(x)
152
+ x = self.attnpool(x)
153
+
154
+ return x
155
+
156
+
157
+ class LayerNorm(nn.LayerNorm):
158
+ """Subclass torch's LayerNorm to handle fp16."""
159
+
160
+ def forward(self, x: torch.Tensor):
161
+ orig_type = x.dtype
162
+ ret = super().forward(x.type(torch.float32))
163
+ return ret.type(orig_type)
164
+
165
+
166
+ class QuickGELU(nn.Module):
167
+ def forward(self, x: torch.Tensor):
168
+ return x * torch.sigmoid(1.702 * x)
169
+
170
+
171
+ class ResidualAttentionBlock(nn.Module):
172
+ def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
173
+ super().__init__()
174
+
175
+ self.attn = nn.MultiheadAttention(d_model, n_head)
176
+ self.ln_1 = LayerNorm(d_model)
177
+ self.mlp = nn.Sequential(OrderedDict([
178
+ ("c_fc", nn.Linear(d_model, d_model * 4)),
179
+ ("gelu", QuickGELU()),
180
+ ("c_proj", nn.Linear(d_model * 4, d_model))
181
+ ]))
182
+ self.ln_2 = LayerNorm(d_model)
183
+ self.attn_mask = attn_mask
184
+
185
+ def attention(self, x: torch.Tensor):
186
+ self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
187
+ return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
188
+
189
+ def forward(self, x: torch.Tensor):
190
+ x = x + self.attention(self.ln_1(x))
191
+ x = x + self.mlp(self.ln_2(x))
192
+ return x
193
+
194
+
195
+ class Transformer(nn.Module):
196
+ def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None):
197
+ super().__init__()
198
+ self.width = width
199
+ self.layers = layers
200
+ self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])
201
+
202
+ def forward(self, x: torch.Tensor):
203
+ return self.resblocks(x)
204
+
205
+
206
+ class VisionTransformer(nn.Module):
207
+ def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int):
208
+ super().__init__()
209
+ self.input_resolution = input_resolution
210
+ self.output_dim = output_dim
211
+ self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
212
+
213
+ scale = width ** -0.5
214
+ self.class_embedding = nn.Parameter(scale * torch.randn(width))
215
+ self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
216
+ self.ln_pre = LayerNorm(width)
217
+
218
+ self.transformer = Transformer(width, layers, heads)
219
+
220
+ self.ln_post = LayerNorm(width)
221
+ self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
222
+
223
+ def forward(self, x: torch.Tensor):
224
+ x = self.conv1(x) # shape = [*, width, grid, grid]
225
+ x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]
226
+ x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]
227
+ x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1) # shape = [*, grid ** 2 + 1, width]
228
+ x = x + self.positional_embedding.to(x.dtype)
229
+ x = self.ln_pre(x)
230
+
231
+ x = x.permute(1, 0, 2) # NLD -> LND
232
+ x = self.transformer(x)
233
+ x = x.permute(1, 0, 2) # LND -> NLD
234
+
235
+ x = self.ln_post(x[:, 0, :])
236
+
237
+ if self.proj is not None:
238
+ x = x @ self.proj
239
+
240
+ return x
241
+
242
+
243
+ class CLIP(nn.Module):
244
+ def __init__(self,
245
+ embed_dim: int,
246
+ # vision
247
+ image_resolution: int,
248
+ vision_layers: Union[Tuple[int, int, int, int], int],
249
+ vision_width: int,
250
+ vision_patch_size: int,
251
+ # text
252
+ context_length: int,
253
+ vocab_size: int,
254
+ transformer_width: int,
255
+ transformer_heads: int,
256
+ transformer_layers: int
257
+ ):
258
+ super().__init__()
259
+
260
+ self.context_length = context_length
261
+
262
+ if isinstance(vision_layers, (tuple, list)):
263
+ vision_heads = vision_width * 32 // 64
264
+ self.visual = ModifiedResNet(
265
+ layers=vision_layers,
266
+ output_dim=embed_dim,
267
+ heads=vision_heads,
268
+ input_resolution=image_resolution,
269
+ width=vision_width
270
+ )
271
+ else:
272
+ vision_heads = vision_width // 64
273
+ self.visual = VisionTransformer(
274
+ input_resolution=image_resolution,
275
+ patch_size=vision_patch_size,
276
+ width=vision_width,
277
+ layers=vision_layers,
278
+ heads=vision_heads,
279
+ output_dim=embed_dim
280
+ )
281
+
282
+ self.transformer = Transformer(
283
+ width=transformer_width,
284
+ layers=transformer_layers,
285
+ heads=transformer_heads,
286
+ attn_mask=self.build_attention_mask()
287
+ )
288
+
289
+ self.vocab_size = vocab_size
290
+ self.token_embedding = nn.Embedding(vocab_size, transformer_width)
291
+ self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
292
+ self.ln_final = LayerNorm(transformer_width)
293
+
294
+ self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
295
+ self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
296
+
297
+ self.initialize_parameters()
298
+
299
+ def initialize_parameters(self):
300
+ nn.init.normal_(self.token_embedding.weight, std=0.02)
301
+ nn.init.normal_(self.positional_embedding, std=0.01)
302
+
303
+ if isinstance(self.visual, ModifiedResNet):
304
+ if self.visual.attnpool is not None:
305
+ std = self.visual.attnpool.c_proj.in_features ** -0.5
306
+ nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
307
+ nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
308
+ nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
309
+ nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)
310
+
311
+ for resnet_block in [self.visual.layer1, self.visual.layer2, self.visual.layer3, self.visual.layer4]:
312
+ for name, param in resnet_block.named_parameters():
313
+ if name.endswith("bn3.weight"):
314
+ nn.init.zeros_(param)
315
+
316
+ proj_std = (self.transformer.width ** -0.5) * ((2 * self.transformer.layers) ** -0.5)
317
+ attn_std = self.transformer.width ** -0.5
318
+ fc_std = (2 * self.transformer.width) ** -0.5
319
+ for block in self.transformer.resblocks:
320
+ nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
321
+ nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
322
+ nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
323
+ nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)
324
+
325
+ if self.text_projection is not None:
326
+ nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)
327
+
328
+ def build_attention_mask(self):
329
+ # lazily create causal attention mask, with full attention between the vision tokens
330
+ # pytorch uses additive attention mask; fill with -inf
331
+ mask = torch.empty(self.context_length, self.context_length)
332
+ mask.fill_(float("-inf"))
333
+ mask.triu_(1) # zero out the lower diagonal
334
+ return mask
335
+
336
+ @property
337
+ def dtype(self):
338
+ return self.visual.conv1.weight.dtype
339
+
340
+ def encode_image(self, image):
341
+ return self.visual(image.type(self.dtype))
342
+
343
+ def encode_text(self, text):
344
+ x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model]
345
+
346
+ x = x + self.positional_embedding.type(self.dtype)
347
+ x = x.permute(1, 0, 2) # NLD -> LND
348
+ x = self.transformer(x)
349
+ x = x.permute(1, 0, 2) # LND -> NLD
350
+ x = self.ln_final(x).type(self.dtype)
351
+
352
+ # x.shape = [batch_size, n_ctx, transformer.width]
353
+ # take features from the eot embedding (eot_token is the highest number in each sequence)
354
+ x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
355
+
356
+ return x
357
+
358
+ def forward(self, image, text):
359
+ image_features = self.encode_image(image)
360
+ text_features = self.encode_text(text)
361
+
362
+ # normalized features
363
+ image_features = image_features / image_features.norm(dim=1, keepdim=True)
364
+ text_features = text_features / text_features.norm(dim=1, keepdim=True)
365
+
366
+ # cosine similarity as logits
367
+ logit_scale = self.logit_scale.exp()
368
+ logits_per_image = logit_scale * image_features @ text_features.t()
369
+ logits_per_text = logits_per_image.t()
370
+
371
+ # shape = [global_batch_size, global_batch_size]
372
+ return logits_per_image, logits_per_text
373
+
374
+
375
+ def convert_weights(model: nn.Module):
376
+ """Convert applicable model parameters to fp16"""
377
+
378
+ def _convert_weights_to_fp16(l):
379
+ if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
380
+ l.weight.data = l.weight.data.half()
381
+ if l.bias is not None:
382
+ l.bias.data = l.bias.data.half()
383
+
384
+ if isinstance(l, nn.MultiheadAttention):
385
+ for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
386
+ tensor = getattr(l, attr)
387
+ if tensor is not None:
388
+ tensor.data = tensor.data.half()
389
+
390
+ for name in ["text_projection", "proj"]:
391
+ if hasattr(l, name):
392
+ attr = getattr(l, name)
393
+ if attr is not None:
394
+ attr.data = attr.data.half()
395
+
396
+ model.apply(_convert_weights_to_fp16)
397
+
398
+
399
+ def build_model(state_dict: dict):
400
+ vit = "visual.proj" in state_dict
401
+
402
+ if vit:
403
+ vision_width = state_dict["visual.conv1.weight"].shape[0]
404
+ vision_layers = len([k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
405
+ vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
406
+ grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
407
+ image_resolution = vision_patch_size * grid_size
408
+ else:
409
+ counts: list = [len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
410
+ vision_layers = tuple(counts)
411
+ vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
412
+ output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
413
+ vision_patch_size = None
414
+ assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
415
+ image_resolution = output_width * 32
416
+
417
+ embed_dim = state_dict["text_projection"].shape[1]
418
+ context_length = state_dict["positional_embedding"].shape[0]
419
+ vocab_size = state_dict["token_embedding.weight"].shape[0]
420
+ transformer_width = state_dict["ln_final.weight"].shape[0]
421
+ transformer_heads = transformer_width // 64
422
+ transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith("transformer.resblocks")))
423
+
424
+ model = CLIP(
425
+ embed_dim,
426
+ image_resolution, vision_layers, vision_width, vision_patch_size,
427
+ context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
428
+ )
429
+
430
+ for key in ["input_resolution", "context_length", "vocab_size"]:
431
+ if key in state_dict:
432
+ del state_dict[key]
433
+
434
+ convert_weights(model)
435
+ model.load_state_dict(state_dict)
436
+ return model.eval()
CLIP/clip/simple_tokenizer.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gzip
2
+ import html
3
+ import os
4
+ from functools import lru_cache
5
+
6
+ import ftfy
7
+ import regex as re
8
+
9
+
10
+ @lru_cache()
11
+ def default_bpe():
12
+ return os.path.join(os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz")
13
+
14
+
15
+ @lru_cache()
16
+ def bytes_to_unicode():
17
+ """
18
+ Returns list of utf-8 byte and a corresponding list of unicode strings.
19
+ The reversible bpe codes work on unicode strings.
20
+ This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
21
+ When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
22
+ This is a signficant percentage of your normal, say, 32K bpe vocab.
23
+ To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
24
+ And avoids mapping to whitespace/control characters the bpe code barfs on.
25
+ """
26
+ bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
27
+ cs = bs[:]
28
+ n = 0
29
+ for b in range(2**8):
30
+ if b not in bs:
31
+ bs.append(b)
32
+ cs.append(2**8+n)
33
+ n += 1
34
+ cs = [chr(n) for n in cs]
35
+ return dict(zip(bs, cs))
36
+
37
+
38
+ def get_pairs(word):
39
+ """Return set of symbol pairs in a word.
40
+ Word is represented as tuple of symbols (symbols being variable-length strings).
41
+ """
42
+ pairs = set()
43
+ prev_char = word[0]
44
+ for char in word[1:]:
45
+ pairs.add((prev_char, char))
46
+ prev_char = char
47
+ return pairs
48
+
49
+
50
+ def basic_clean(text):
51
+ text = ftfy.fix_text(text)
52
+ text = html.unescape(html.unescape(text))
53
+ return text.strip()
54
+
55
+
56
+ def whitespace_clean(text):
57
+ text = re.sub(r'\s+', ' ', text)
58
+ text = text.strip()
59
+ return text
60
+
61
+
62
+ class SimpleTokenizer(object):
63
+ def __init__(self, bpe_path: str = default_bpe()):
64
+ self.byte_encoder = bytes_to_unicode()
65
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
66
+ merges = gzip.open(bpe_path).read().decode("utf-8").split('\n')
67
+ merges = merges[1:49152-256-2+1]
68
+ merges = [tuple(merge.split()) for merge in merges]
69
+ vocab = list(bytes_to_unicode().values())
70
+ vocab = vocab + [v+'</w>' for v in vocab]
71
+ for merge in merges:
72
+ vocab.append(''.join(merge))
73
+ vocab.extend(['<|startoftext|>', '<|endoftext|>'])
74
+ self.encoder = dict(zip(vocab, range(len(vocab))))
75
+ self.decoder = {v: k for k, v in self.encoder.items()}
76
+ self.bpe_ranks = dict(zip(merges, range(len(merges))))
77
+ self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}
78
+ self.pat = re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", re.IGNORECASE)
79
+
80
+ def bpe(self, token):
81
+ if token in self.cache:
82
+ return self.cache[token]
83
+ word = tuple(token[:-1]) + ( token[-1] + '</w>',)
84
+ pairs = get_pairs(word)
85
+
86
+ if not pairs:
87
+ return token+'</w>'
88
+
89
+ while True:
90
+ bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
91
+ if bigram not in self.bpe_ranks:
92
+ break
93
+ first, second = bigram
94
+ new_word = []
95
+ i = 0
96
+ while i < len(word):
97
+ try:
98
+ j = word.index(first, i)
99
+ new_word.extend(word[i:j])
100
+ i = j
101
+ except:
102
+ new_word.extend(word[i:])
103
+ break
104
+
105
+ if word[i] == first and i < len(word)-1 and word[i+1] == second:
106
+ new_word.append(first+second)
107
+ i += 2
108
+ else:
109
+ new_word.append(word[i])
110
+ i += 1
111
+ new_word = tuple(new_word)
112
+ word = new_word
113
+ if len(word) == 1:
114
+ break
115
+ else:
116
+ pairs = get_pairs(word)
117
+ word = ' '.join(word)
118
+ self.cache[token] = word
119
+ return word
120
+
121
+ def encode(self, text):
122
+ bpe_tokens = []
123
+ text = whitespace_clean(basic_clean(text)).lower()
124
+ for token in re.findall(self.pat, text):
125
+ token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
126
+ bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
127
+ return bpe_tokens
128
+
129
+ def decode(self, tokens):
130
+ text = ''.join([self.decoder[token] for token in tokens])
131
+ text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors="replace").replace('</w>', ' ')
132
+ return text
CLIP/data/country211.md ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # The Country211 Dataset
2
+
3
+ In the paper, we used an image classification dataset called Country211, to evaluate the model's capability on geolocation. To do so, we filtered the YFCC100m dataset that have GPS coordinate corresponding to a [ISO-3166 country code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) and created a balanced dataset by sampling 150 train images, 50 validation images, and 100 test images images for each country.
4
+
5
+ The following command will download an 11GB archive countaining the images and extract into a subdirectory `country211`:
6
+
7
+ ```bash
8
+ wget https://openaipublic.azureedge.net/clip/data/country211.tgz
9
+ tar zxvf country211.tgz
10
+ ```
11
+
12
+ These images are a subset of the YFCC100m dataset. Use of the underlying media files is subject to the Creative Commons licenses chosen by their creators/uploaders. For more information about the YFCC100M dataset, visit [the official website](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/).
CLIP/data/prompts.md ADDED
@@ -0,0 +1,3401 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prompts for Image Classification
2
+
3
+ Below are the class names and templates that are used for collecting the zero-shot classification scores in the paper. Each dataset has two lists `classes` and `templates`, where the string `{}` in the template is to be replaced with the corresponding class names. For the Facial Emotion Recognition 2013 dataset specifically, we used multiple class names for certain classes.
4
+
5
+ This file contains prompt data for 26 of the 27 datasets shown in Table 9 of the paper; the text prompts for ImageNet (as well as other [ImageNet Testbed](https://modestyachts.github.io/imagenet-testbed/) datasets in Figure 13) can be found in [this notebook](https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb), as well as how to ensemble predictions from multiple prompts using these templates.
6
+
7
+ If you are viewing this document on GitHub, use the table of contents icon at the upper left to browse the datasets.
8
+
9
+
10
+ ## Birdsnap
11
+
12
+ ```bash
13
+ classes = [
14
+ 'Acadian Flycatcher',
15
+ 'Acorn Woodpecker',
16
+ 'Alder Flycatcher',
17
+ 'Allens Hummingbird',
18
+ 'Altamira Oriole',
19
+ 'American Avocet',
20
+ 'American Bittern',
21
+ 'American Black Duck',
22
+ 'American Coot',
23
+ 'American Crow',
24
+ 'American Dipper',
25
+ 'American Golden Plover',
26
+ 'American Goldfinch',
27
+ 'American Kestrel',
28
+ 'American Oystercatcher',
29
+ 'American Pipit',
30
+ 'American Redstart',
31
+ 'American Robin',
32
+ 'American Three toed Woodpecker',
33
+ 'American Tree Sparrow',
34
+ 'American White Pelican',
35
+ 'American Wigeon',
36
+ 'American Woodcock',
37
+ 'Anhinga',
38
+ 'Annas Hummingbird',
39
+ 'Arctic Tern',
40
+ 'Ash throated Flycatcher',
41
+ 'Audubons Oriole',
42
+ 'Bairds Sandpiper',
43
+ 'Bald Eagle',
44
+ 'Baltimore Oriole',
45
+ 'Band tailed Pigeon',
46
+ 'Barn Swallow',
47
+ 'Barred Owl',
48
+ 'Barrows Goldeneye',
49
+ 'Bay breasted Warbler',
50
+ 'Bells Vireo',
51
+ 'Belted Kingfisher',
52
+ 'Bewicks Wren',
53
+ 'Black Guillemot',
54
+ 'Black Oystercatcher',
55
+ 'Black Phoebe',
56
+ 'Black Rosy Finch',
57
+ 'Black Scoter',
58
+ 'Black Skimmer',
59
+ 'Black Tern',
60
+ 'Black Turnstone',
61
+ 'Black Vulture',
62
+ 'Black and white Warbler',
63
+ 'Black backed Woodpecker',
64
+ 'Black bellied Plover',
65
+ 'Black billed Cuckoo',
66
+ 'Black billed Magpie',
67
+ 'Black capped Chickadee',
68
+ 'Black chinned Hummingbird',
69
+ 'Black chinned Sparrow',
70
+ 'Black crested Titmouse',
71
+ 'Black crowned Night Heron',
72
+ 'Black headed Grosbeak',
73
+ 'Black legged Kittiwake',
74
+ 'Black necked Stilt',
75
+ 'Black throated Blue Warbler',
76
+ 'Black throated Gray Warbler',
77
+ 'Black throated Green Warbler',
78
+ 'Black throated Sparrow',
79
+ 'Blackburnian Warbler',
80
+ 'Blackpoll Warbler',
81
+ 'Blue Grosbeak',
82
+ 'Blue Jay',
83
+ 'Blue gray Gnatcatcher',
84
+ 'Blue headed Vireo',
85
+ 'Blue winged Teal',
86
+ 'Blue winged Warbler',
87
+ 'Boat tailed Grackle',
88
+ 'Bobolink',
89
+ 'Bohemian Waxwing',
90
+ 'Bonapartes Gull',
91
+ 'Boreal Chickadee',
92
+ 'Brandts Cormorant',
93
+ 'Brant',
94
+ 'Brewers Blackbird',
95
+ 'Brewers Sparrow',
96
+ 'Bridled Titmouse',
97
+ 'Broad billed Hummingbird',
98
+ 'Broad tailed Hummingbird',
99
+ 'Broad winged Hawk',
100
+ 'Bronzed Cowbird',
101
+ 'Brown Creeper',
102
+ 'Brown Pelican',
103
+ 'Brown Thrasher',
104
+ 'Brown capped Rosy Finch',
105
+ 'Brown crested Flycatcher',
106
+ 'Brown headed Cowbird',
107
+ 'Brown headed Nuthatch',
108
+ 'Bufflehead',
109
+ 'Bullocks Oriole',
110
+ 'Burrowing Owl',
111
+ 'Bushtit',
112
+ 'Cackling Goose',
113
+ 'Cactus Wren',
114
+ 'California Gull',
115
+ 'California Quail',
116
+ 'California Thrasher',
117
+ 'California Towhee',
118
+ 'Calliope Hummingbird',
119
+ 'Canada Goose',
120
+ 'Canada Warbler',
121
+ 'Canvasback',
122
+ 'Canyon Towhee',
123
+ 'Canyon Wren',
124
+ 'Cape May Warbler',
125
+ 'Carolina Chickadee',
126
+ 'Carolina Wren',
127
+ 'Caspian Tern',
128
+ 'Cassins Finch',
129
+ 'Cassins Kingbird',
130
+ 'Cassins Sparrow',
131
+ 'Cassins Vireo',
132
+ 'Cattle Egret',
133
+ 'Cave Swallow',
134
+ 'Cedar Waxwing',
135
+ 'Cerulean Warbler',
136
+ 'Chestnut backed Chickadee',
137
+ 'Chestnut collared Longspur',
138
+ 'Chestnut sided Warbler',
139
+ 'Chihuahuan Raven',
140
+ 'Chimney Swift',
141
+ 'Chipping Sparrow',
142
+ 'Cinnamon Teal',
143
+ 'Clapper Rail',
144
+ 'Clarks Grebe',
145
+ 'Clarks Nutcracker',
146
+ 'Clay colored Sparrow',
147
+ 'Cliff Swallow',
148
+ 'Common Black Hawk',
149
+ 'Common Eider',
150
+ 'Common Gallinule',
151
+ 'Common Goldeneye',
152
+ 'Common Grackle',
153
+ 'Common Ground Dove',
154
+ 'Common Loon',
155
+ 'Common Merganser',
156
+ 'Common Murre',
157
+ 'Common Nighthawk',
158
+ 'Common Raven',
159
+ 'Common Redpoll',
160
+ 'Common Tern',
161
+ 'Common Yellowthroat',
162
+ 'Connecticut Warbler',
163
+ 'Coopers Hawk',
164
+ 'Cordilleran Flycatcher',
165
+ 'Costas Hummingbird',
166
+ 'Couchs Kingbird',
167
+ 'Crested Caracara',
168
+ 'Curve billed Thrasher',
169
+ 'Dark eyed Junco',
170
+ 'Dickcissel',
171
+ 'Double crested Cormorant',
172
+ 'Downy Woodpecker',
173
+ 'Dunlin',
174
+ 'Dusky Flycatcher',
175
+ 'Dusky Grouse',
176
+ 'Eared Grebe',
177
+ 'Eastern Bluebird',
178
+ 'Eastern Kingbird',
179
+ 'Eastern Meadowlark',
180
+ 'Eastern Phoebe',
181
+ 'Eastern Screech Owl',
182
+ 'Eastern Towhee',
183
+ 'Eastern Wood Pewee',
184
+ 'Elegant Trogon',
185
+ 'Elf Owl',
186
+ 'Eurasian Collared Dove',
187
+ 'Eurasian Wigeon',
188
+ 'European Starling',
189
+ 'Evening Grosbeak',
190
+ 'Ferruginous Hawk',
191
+ 'Ferruginous Pygmy Owl',
192
+ 'Field Sparrow',
193
+ 'Fish Crow',
194
+ 'Florida Scrub Jay',
195
+ 'Forsters Tern',
196
+ 'Fox Sparrow',
197
+ 'Franklins Gull',
198
+ 'Fulvous Whistling Duck',
199
+ 'Gadwall',
200
+ 'Gambels Quail',
201
+ 'Gila Woodpecker',
202
+ 'Glaucous Gull',
203
+ 'Glaucous winged Gull',
204
+ 'Glossy Ibis',
205
+ 'Golden Eagle',
206
+ 'Golden crowned Kinglet',
207
+ 'Golden crowned Sparrow',
208
+ 'Golden fronted Woodpecker',
209
+ 'Golden winged Warbler',
210
+ 'Grasshopper Sparrow',
211
+ 'Gray Catbird',
212
+ 'Gray Flycatcher',
213
+ 'Gray Jay',
214
+ 'Gray Kingbird',
215
+ 'Gray cheeked Thrush',
216
+ 'Gray crowned Rosy Finch',
217
+ 'Great Black backed Gull',
218
+ 'Great Blue Heron',
219
+ 'Great Cormorant',
220
+ 'Great Crested Flycatcher',
221
+ 'Great Egret',
222
+ 'Great Gray Owl',
223
+ 'Great Horned Owl',
224
+ 'Great Kiskadee',
225
+ 'Great tailed Grackle',
226
+ 'Greater Prairie Chicken',
227
+ 'Greater Roadrunner',
228
+ 'Greater Sage Grouse',
229
+ 'Greater Scaup',
230
+ 'Greater White fronted Goose',
231
+ 'Greater Yellowlegs',
232
+ 'Green Jay',
233
+ 'Green tailed Towhee',
234
+ 'Green winged Teal',
235
+ 'Groove billed Ani',
236
+ 'Gull billed Tern',
237
+ 'Hairy Woodpecker',
238
+ 'Hammonds Flycatcher',
239
+ 'Harlequin Duck',
240
+ 'Harriss Hawk',
241
+ 'Harriss Sparrow',
242
+ 'Heermanns Gull',
243
+ 'Henslows Sparrow',
244
+ 'Hepatic Tanager',
245
+ 'Hermit Thrush',
246
+ 'Herring Gull',
247
+ 'Hoary Redpoll',
248
+ 'Hooded Merganser',
249
+ 'Hooded Oriole',
250
+ 'Hooded Warbler',
251
+ 'Horned Grebe',
252
+ 'Horned Lark',
253
+ 'House Finch',
254
+ 'House Sparrow',
255
+ 'House Wren',
256
+ 'Huttons Vireo',
257
+ 'Iceland Gull',
258
+ 'Inca Dove',
259
+ 'Indigo Bunting',
260
+ 'Killdeer',
261
+ 'King Rail',
262
+ 'Ladder backed Woodpecker',
263
+ 'Lapland Longspur',
264
+ 'Lark Bunting',
265
+ 'Lark Sparrow',
266
+ 'Laughing Gull',
267
+ 'Lazuli Bunting',
268
+ 'Le Contes Sparrow',
269
+ 'Least Bittern',
270
+ 'Least Flycatcher',
271
+ 'Least Grebe',
272
+ 'Least Sandpiper',
273
+ 'Least Tern',
274
+ 'Lesser Goldfinch',
275
+ 'Lesser Nighthawk',
276
+ 'Lesser Scaup',
277
+ 'Lesser Yellowlegs',
278
+ 'Lewiss Woodpecker',
279
+ 'Limpkin',
280
+ 'Lincolns Sparrow',
281
+ 'Little Blue Heron',
282
+ 'Loggerhead Shrike',
283
+ 'Long billed Curlew',
284
+ 'Long billed Dowitcher',
285
+ 'Long billed Thrasher',
286
+ 'Long eared Owl',
287
+ 'Long tailed Duck',
288
+ 'Louisiana Waterthrush',
289
+ 'Magnificent Frigatebird',
290
+ 'Magnolia Warbler',
291
+ 'Mallard',
292
+ 'Marbled Godwit',
293
+ 'Marsh Wren',
294
+ 'Merlin',
295
+ 'Mew Gull',
296
+ 'Mexican Jay',
297
+ 'Mississippi Kite',
298
+ 'Monk Parakeet',
299
+ 'Mottled Duck',
300
+ 'Mountain Bluebird',
301
+ 'Mountain Chickadee',
302
+ 'Mountain Plover',
303
+ 'Mourning Dove',
304
+ 'Mourning Warbler',
305
+ 'Muscovy Duck',
306
+ 'Mute Swan',
307
+ 'Nashville Warbler',
308
+ 'Nelsons Sparrow',
309
+ 'Neotropic Cormorant',
310
+ 'Northern Bobwhite',
311
+ 'Northern Cardinal',
312
+ 'Northern Flicker',
313
+ 'Northern Gannet',
314
+ 'Northern Goshawk',
315
+ 'Northern Harrier',
316
+ 'Northern Hawk Owl',
317
+ 'Northern Mockingbird',
318
+ 'Northern Parula',
319
+ 'Northern Pintail',
320
+ 'Northern Rough winged Swallow',
321
+ 'Northern Saw whet Owl',
322
+ 'Northern Shrike',
323
+ 'Northern Waterthrush',
324
+ 'Nuttalls Woodpecker',
325
+ 'Oak Titmouse',
326
+ 'Olive Sparrow',
327
+ 'Olive sided Flycatcher',
328
+ 'Orange crowned Warbler',
329
+ 'Orchard Oriole',
330
+ 'Osprey',
331
+ 'Ovenbird',
332
+ 'Pacific Golden Plover',
333
+ 'Pacific Loon',
334
+ 'Pacific Wren',
335
+ 'Pacific slope Flycatcher',
336
+ 'Painted Bunting',
337
+ 'Painted Redstart',
338
+ 'Palm Warbler',
339
+ 'Pectoral Sandpiper',
340
+ 'Peregrine Falcon',
341
+ 'Phainopepla',
342
+ 'Philadelphia Vireo',
343
+ 'Pied billed Grebe',
344
+ 'Pigeon Guillemot',
345
+ 'Pileated Woodpecker',
346
+ 'Pine Grosbeak',
347
+ 'Pine Siskin',
348
+ 'Pine Warbler',
349
+ 'Piping Plover',
350
+ 'Plumbeous Vireo',
351
+ 'Prairie Falcon',
352
+ 'Prairie Warbler',
353
+ 'Prothonotary Warbler',
354
+ 'Purple Finch',
355
+ 'Purple Gallinule',
356
+ 'Purple Martin',
357
+ 'Purple Sandpiper',
358
+ 'Pygmy Nuthatch',
359
+ 'Pyrrhuloxia',
360
+ 'Red Crossbill',
361
+ 'Red Knot',
362
+ 'Red Phalarope',
363
+ 'Red bellied Woodpecker',
364
+ 'Red breasted Merganser',
365
+ 'Red breasted Nuthatch',
366
+ 'Red breasted Sapsucker',
367
+ 'Red cockaded Woodpecker',
368
+ 'Red eyed Vireo',
369
+ 'Red headed Woodpecker',
370
+ 'Red naped Sapsucker',
371
+ 'Red necked Grebe',
372
+ 'Red necked Phalarope',
373
+ 'Red shouldered Hawk',
374
+ 'Red tailed Hawk',
375
+ 'Red throated Loon',
376
+ 'Red winged Blackbird',
377
+ 'Reddish Egret',
378
+ 'Redhead',
379
+ 'Ring billed Gull',
380
+ 'Ring necked Duck',
381
+ 'Ring necked Pheasant',
382
+ 'Rock Pigeon',
383
+ 'Rock Ptarmigan',
384
+ 'Rock Sandpiper',
385
+ 'Rock Wren',
386
+ 'Rose breasted Grosbeak',
387
+ 'Roseate Tern',
388
+ 'Rosss Goose',
389
+ 'Rough legged Hawk',
390
+ 'Royal Tern',
391
+ 'Ruby crowned Kinglet',
392
+ 'Ruby throated Hummingbird',
393
+ 'Ruddy Duck',
394
+ 'Ruddy Turnstone',
395
+ 'Ruffed Grouse',
396
+ 'Rufous Hummingbird',
397
+ 'Rufous crowned Sparrow',
398
+ 'Rusty Blackbird',
399
+ 'Sage Thrasher',
400
+ 'Saltmarsh Sparrow',
401
+ 'Sanderling',
402
+ 'Sandhill Crane',
403
+ 'Sandwich Tern',
404
+ 'Says Phoebe',
405
+ 'Scaled Quail',
406
+ 'Scarlet Tanager',
407
+ 'Scissor tailed Flycatcher',
408
+ 'Scotts Oriole',
409
+ 'Seaside Sparrow',
410
+ 'Sedge Wren',
411
+ 'Semipalmated Plover',
412
+ 'Semipalmated Sandpiper',
413
+ 'Sharp shinned Hawk',
414
+ 'Sharp tailed Grouse',
415
+ 'Short billed Dowitcher',
416
+ 'Short eared Owl',
417
+ 'Snail Kite',
418
+ 'Snow Bunting',
419
+ 'Snow Goose',
420
+ 'Snowy Egret',
421
+ 'Snowy Owl',
422
+ 'Snowy Plover',
423
+ 'Solitary Sandpiper',
424
+ 'Song Sparrow',
425
+ 'Sooty Grouse',
426
+ 'Sora',
427
+ 'Spotted Owl',
428
+ 'Spotted Sandpiper',
429
+ 'Spotted Towhee',
430
+ 'Spruce Grouse',
431
+ 'Stellers Jay',
432
+ 'Stilt Sandpiper',
433
+ 'Summer Tanager',
434
+ 'Surf Scoter',
435
+ 'Surfbird',
436
+ 'Swainsons Hawk',
437
+ 'Swainsons Thrush',
438
+ 'Swallow tailed Kite',
439
+ 'Swamp Sparrow',
440
+ 'Tennessee Warbler',
441
+ 'Thayers Gull',
442
+ 'Townsends Solitaire',
443
+ 'Townsends Warbler',
444
+ 'Tree Swallow',
445
+ 'Tricolored Heron',
446
+ 'Tropical Kingbird',
447
+ 'Trumpeter Swan',
448
+ 'Tufted Titmouse',
449
+ 'Tundra Swan',
450
+ 'Turkey Vulture',
451
+ 'Upland Sandpiper',
452
+ 'Varied Thrush',
453
+ 'Veery',
454
+ 'Verdin',
455
+ 'Vermilion Flycatcher',
456
+ 'Vesper Sparrow',
457
+ 'Violet green Swallow',
458
+ 'Virginia Rail',
459
+ 'Wandering Tattler',
460
+ 'Warbling Vireo',
461
+ 'Western Bluebird',
462
+ 'Western Grebe',
463
+ 'Western Gull',
464
+ 'Western Kingbird',
465
+ 'Western Meadowlark',
466
+ 'Western Sandpiper',
467
+ 'Western Screech Owl',
468
+ 'Western Scrub Jay',
469
+ 'Western Tanager',
470
+ 'Western Wood Pewee',
471
+ 'Whimbrel',
472
+ 'White Ibis',
473
+ 'White breasted Nuthatch',
474
+ 'White crowned Sparrow',
475
+ 'White eyed Vireo',
476
+ 'White faced Ibis',
477
+ 'White headed Woodpecker',
478
+ 'White rumped Sandpiper',
479
+ 'White tailed Hawk',
480
+ 'White tailed Kite',
481
+ 'White tailed Ptarmigan',
482
+ 'White throated Sparrow',
483
+ 'White throated Swift',
484
+ 'White winged Crossbill',
485
+ 'White winged Dove',
486
+ 'White winged Scoter',
487
+ 'Wild Turkey',
488
+ 'Willet',
489
+ 'Williamsons Sapsucker',
490
+ 'Willow Flycatcher',
491
+ 'Willow Ptarmigan',
492
+ 'Wilsons Phalarope',
493
+ 'Wilsons Plover',
494
+ 'Wilsons Snipe',
495
+ 'Wilsons Warbler',
496
+ 'Winter Wren',
497
+ 'Wood Stork',
498
+ 'Wood Thrush',
499
+ 'Worm eating Warbler',
500
+ 'Wrentit',
501
+ 'Yellow Warbler',
502
+ 'Yellow bellied Flycatcher',
503
+ 'Yellow bellied Sapsucker',
504
+ 'Yellow billed Cuckoo',
505
+ 'Yellow billed Magpie',
506
+ 'Yellow breasted Chat',
507
+ 'Yellow crowned Night Heron',
508
+ 'Yellow eyed Junco',
509
+ 'Yellow headed Blackbird',
510
+ 'Yellow rumped Warbler',
511
+ 'Yellow throated Vireo',
512
+ 'Yellow throated Warbler',
513
+ 'Zone tailed Hawk',
514
+ ]
515
+
516
+ templates = [
517
+ 'a photo of a {}, a type of bird.',
518
+ ]
519
+ ```
520
+
521
+
522
+
523
+ ## CIFAR10
524
+
525
+ ```bash
526
+ classes = [
527
+ 'airplane',
528
+ 'automobile',
529
+ 'bird',
530
+ 'cat',
531
+ 'deer',
532
+ 'dog',
533
+ 'frog',
534
+ 'horse',
535
+ 'ship',
536
+ 'truck',
537
+ ]
538
+
539
+ templates = [
540
+ 'a photo of a {}.',
541
+ 'a blurry photo of a {}.',
542
+ 'a black and white photo of a {}.',
543
+ 'a low contrast photo of a {}.',
544
+ 'a high contrast photo of a {}.',
545
+ 'a bad photo of a {}.',
546
+ 'a good photo of a {}.',
547
+ 'a photo of a small {}.',
548
+ 'a photo of a big {}.',
549
+ 'a photo of the {}.',
550
+ 'a blurry photo of the {}.',
551
+ 'a black and white photo of the {}.',
552
+ 'a low contrast photo of the {}.',
553
+ 'a high contrast photo of the {}.',
554
+ 'a bad photo of the {}.',
555
+ 'a good photo of the {}.',
556
+ 'a photo of the small {}.',
557
+ 'a photo of the big {}.',
558
+ ]
559
+ ```
560
+
561
+
562
+
563
+ ## CIFAR100
564
+
565
+ ```bash
566
+ classes = [
567
+ 'apple',
568
+ 'aquarium fish',
569
+ 'baby',
570
+ 'bear',
571
+ 'beaver',
572
+ 'bed',
573
+ 'bee',
574
+ 'beetle',
575
+ 'bicycle',
576
+ 'bottle',
577
+ 'bowl',
578
+ 'boy',
579
+ 'bridge',
580
+ 'bus',
581
+ 'butterfly',
582
+ 'camel',
583
+ 'can',
584
+ 'castle',
585
+ 'caterpillar',
586
+ 'cattle',
587
+ 'chair',
588
+ 'chimpanzee',
589
+ 'clock',
590
+ 'cloud',
591
+ 'cockroach',
592
+ 'couch',
593
+ 'crab',
594
+ 'crocodile',
595
+ 'cup',
596
+ 'dinosaur',
597
+ 'dolphin',
598
+ 'elephant',
599
+ 'flatfish',
600
+ 'forest',
601
+ 'fox',
602
+ 'girl',
603
+ 'hamster',
604
+ 'house',
605
+ 'kangaroo',
606
+ 'keyboard',
607
+ 'lamp',
608
+ 'lawn mower',
609
+ 'leopard',
610
+ 'lion',
611
+ 'lizard',
612
+ 'lobster',
613
+ 'man',
614
+ 'maple tree',
615
+ 'motorcycle',
616
+ 'mountain',
617
+ 'mouse',
618
+ 'mushroom',
619
+ 'oak tree',
620
+ 'orange',
621
+ 'orchid',
622
+ 'otter',
623
+ 'palm tree',
624
+ 'pear',
625
+ 'pickup truck',
626
+ 'pine tree',
627
+ 'plain',
628
+ 'plate',
629
+ 'poppy',
630
+ 'porcupine',
631
+ 'possum',
632
+ 'rabbit',
633
+ 'raccoon',
634
+ 'ray',
635
+ 'road',
636
+ 'rocket',
637
+ 'rose',
638
+ 'sea',
639
+ 'seal',
640
+ 'shark',
641
+ 'shrew',
642
+ 'skunk',
643
+ 'skyscraper',
644
+ 'snail',
645
+ 'snake',
646
+ 'spider',
647
+ 'squirrel',
648
+ 'streetcar',
649
+ 'sunflower',
650
+ 'sweet pepper',
651
+ 'table',
652
+ 'tank',
653
+ 'telephone',
654
+ 'television',
655
+ 'tiger',
656
+ 'tractor',
657
+ 'train',
658
+ 'trout',
659
+ 'tulip',
660
+ 'turtle',
661
+ 'wardrobe',
662
+ 'whale',
663
+ 'willow tree',
664
+ 'wolf',
665
+ 'woman',
666
+ 'worm',
667
+ ]
668
+
669
+ templates = [
670
+ 'a photo of a {}.',
671
+ 'a blurry photo of a {}.',
672
+ 'a black and white photo of a {}.',
673
+ 'a low contrast photo of a {}.',
674
+ 'a high contrast photo of a {}.',
675
+ 'a bad photo of a {}.',
676
+ 'a good photo of a {}.',
677
+ 'a photo of a small {}.',
678
+ 'a photo of a big {}.',
679
+ 'a photo of the {}.',
680
+ 'a blurry photo of the {}.',
681
+ 'a black and white photo of the {}.',
682
+ 'a low contrast photo of the {}.',
683
+ 'a high contrast photo of the {}.',
684
+ 'a bad photo of the {}.',
685
+ 'a good photo of the {}.',
686
+ 'a photo of the small {}.',
687
+ 'a photo of the big {}.',
688
+ ]
689
+ ```
690
+
691
+
692
+
693
+ ## CLEVRCounts
694
+
695
+ ```bash
696
+ classes = [
697
+ '10',
698
+ '3',
699
+ '4',
700
+ '5',
701
+ '6',
702
+ '7',
703
+ '8',
704
+ '9',
705
+ ]
706
+
707
+ templates = [
708
+ 'a photo of {} objects.',
709
+ ]
710
+ ```
711
+
712
+
713
+
714
+ ## Caltech101
715
+
716
+ ```bash
717
+ classes = [
718
+ 'background',
719
+ 'off-center face',
720
+ 'centered face',
721
+ 'leopard',
722
+ 'motorbike',
723
+ 'accordion',
724
+ 'airplane',
725
+ 'anchor',
726
+ 'ant',
727
+ 'barrel',
728
+ 'bass',
729
+ 'beaver',
730
+ 'binocular',
731
+ 'bonsai',
732
+ 'brain',
733
+ 'brontosaurus',
734
+ 'buddha',
735
+ 'butterfly',
736
+ 'camera',
737
+ 'cannon',
738
+ 'side of a car',
739
+ 'ceiling fan',
740
+ 'cellphone',
741
+ 'chair',
742
+ 'chandelier',
743
+ 'body of a cougar cat',
744
+ 'face of a cougar cat',
745
+ 'crab',
746
+ 'crayfish',
747
+ 'crocodile',
748
+ 'head of a crocodile',
749
+ 'cup',
750
+ 'dalmatian',
751
+ 'dollar bill',
752
+ 'dolphin',
753
+ 'dragonfly',
754
+ 'electric guitar',
755
+ 'elephant',
756
+ 'emu',
757
+ 'euphonium',
758
+ 'ewer',
759
+ 'ferry',
760
+ 'flamingo',
761
+ 'head of a flamingo',
762
+ 'garfield',
763
+ 'gerenuk',
764
+ 'gramophone',
765
+ 'grand piano',
766
+ 'hawksbill',
767
+ 'headphone',
768
+ 'hedgehog',
769
+ 'helicopter',
770
+ 'ibis',
771
+ 'inline skate',
772
+ 'joshua tree',
773
+ 'kangaroo',
774
+ 'ketch',
775
+ 'lamp',
776
+ 'laptop',
777
+ 'llama',
778
+ 'lobster',
779
+ 'lotus',
780
+ 'mandolin',
781
+ 'mayfly',
782
+ 'menorah',
783
+ 'metronome',
784
+ 'minaret',
785
+ 'nautilus',
786
+ 'octopus',
787
+ 'okapi',
788
+ 'pagoda',
789
+ 'panda',
790
+ 'pigeon',
791
+ 'pizza',
792
+ 'platypus',
793
+ 'pyramid',
794
+ 'revolver',
795
+ 'rhino',
796
+ 'rooster',
797
+ 'saxophone',
798
+ 'schooner',
799
+ 'scissors',
800
+ 'scorpion',
801
+ 'sea horse',
802
+ 'snoopy (cartoon beagle)',
803
+ 'soccer ball',
804
+ 'stapler',
805
+ 'starfish',
806
+ 'stegosaurus',
807
+ 'stop sign',
808
+ 'strawberry',
809
+ 'sunflower',
810
+ 'tick',
811
+ 'trilobite',
812
+ 'umbrella',
813
+ 'watch',
814
+ 'water lilly',
815
+ 'wheelchair',
816
+ 'wild cat',
817
+ 'windsor chair',
818
+ 'wrench',
819
+ 'yin and yang symbol',
820
+ ]
821
+
822
+ templates = [
823
+ 'a photo of a {}.',
824
+ 'a painting of a {}.',
825
+ 'a plastic {}.',
826
+ 'a sculpture of a {}.',
827
+ 'a sketch of a {}.',
828
+ 'a tattoo of a {}.',
829
+ 'a toy {}.',
830
+ 'a rendition of a {}.',
831
+ 'a embroidered {}.',
832
+ 'a cartoon {}.',
833
+ 'a {} in a video game.',
834
+ 'a plushie {}.',
835
+ 'a origami {}.',
836
+ 'art of a {}.',
837
+ 'graffiti of a {}.',
838
+ 'a drawing of a {}.',
839
+ 'a doodle of a {}.',
840
+ 'a photo of the {}.',
841
+ 'a painting of the {}.',
842
+ 'the plastic {}.',
843
+ 'a sculpture of the {}.',
844
+ 'a sketch of the {}.',
845
+ 'a tattoo of the {}.',
846
+ 'the toy {}.',
847
+ 'a rendition of the {}.',
848
+ 'the embroidered {}.',
849
+ 'the cartoon {}.',
850
+ 'the {} in a video game.',
851
+ 'the plushie {}.',
852
+ 'the origami {}.',
853
+ 'art of the {}.',
854
+ 'graffiti of the {}.',
855
+ 'a drawing of the {}.',
856
+ 'a doodle of the {}.',
857
+ ]
858
+ ```
859
+
860
+
861
+
862
+ ## Country211
863
+
864
+ ```bash
865
+ classes = [
866
+ 'Andorra',
867
+ 'United Arab Emirates',
868
+ 'Afghanistan',
869
+ 'Antigua and Barbuda',
870
+ 'Anguilla',
871
+ 'Albania',
872
+ 'Armenia',
873
+ 'Angola',
874
+ 'Antarctica',
875
+ 'Argentina',
876
+ 'Austria',
877
+ 'Australia',
878
+ 'Aruba',
879
+ 'Aland Islands',
880
+ 'Azerbaijan',
881
+ 'Bosnia and Herzegovina',
882
+ 'Barbados',
883
+ 'Bangladesh',
884
+ 'Belgium',
885
+ 'Burkina Faso',
886
+ 'Bulgaria',
887
+ 'Bahrain',
888
+ 'Benin',
889
+ 'Bermuda',
890
+ 'Brunei Darussalam',
891
+ 'Bolivia',
892
+ 'Bonaire, Saint Eustatius and Saba',
893
+ 'Brazil',
894
+ 'Bahamas',
895
+ 'Bhutan',
896
+ 'Botswana',
897
+ 'Belarus',
898
+ 'Belize',
899
+ 'Canada',
900
+ 'DR Congo',
901
+ 'Central African Republic',
902
+ 'Switzerland',
903
+ "Cote d'Ivoire",
904
+ 'Cook Islands',
905
+ 'Chile',
906
+ 'Cameroon',
907
+ 'China',
908
+ 'Colombia',
909
+ 'Costa Rica',
910
+ 'Cuba',
911
+ 'Cabo Verde',
912
+ 'Curacao',
913
+ 'Cyprus',
914
+ 'Czech Republic',
915
+ 'Germany',
916
+ 'Denmark',
917
+ 'Dominica',
918
+ 'Dominican Republic',
919
+ 'Algeria',
920
+ 'Ecuador',
921
+ 'Estonia',
922
+ 'Egypt',
923
+ 'Spain',
924
+ 'Ethiopia',
925
+ 'Finland',
926
+ 'Fiji',
927
+ 'Falkland Islands',
928
+ 'Faeroe Islands',
929
+ 'France',
930
+ 'Gabon',
931
+ 'United Kingdom',
932
+ 'Grenada',
933
+ 'Georgia',
934
+ 'French Guiana',
935
+ 'Guernsey',
936
+ 'Ghana',
937
+ 'Gibraltar',
938
+ 'Greenland',
939
+ 'Gambia',
940
+ 'Guadeloupe',
941
+ 'Greece',
942
+ 'South Georgia and South Sandwich Is.',
943
+ 'Guatemala',
944
+ 'Guam',
945
+ 'Guyana',
946
+ 'Hong Kong',
947
+ 'Honduras',
948
+ 'Croatia',
949
+ 'Haiti',
950
+ 'Hungary',
951
+ 'Indonesia',
952
+ 'Ireland',
953
+ 'Israel',
954
+ 'Isle of Man',
955
+ 'India',
956
+ 'Iraq',
957
+ 'Iran',
958
+ 'Iceland',
959
+ 'Italy',
960
+ 'Jersey',
961
+ 'Jamaica',
962
+ 'Jordan',
963
+ 'Japan',
964
+ 'Kenya',
965
+ 'Kyrgyz Republic',
966
+ 'Cambodia',
967
+ 'St. Kitts and Nevis',
968
+ 'North Korea',
969
+ 'South Korea',
970
+ 'Kuwait',
971
+ 'Cayman Islands',
972
+ 'Kazakhstan',
973
+ 'Laos',
974
+ 'Lebanon',
975
+ 'St. Lucia',
976
+ 'Liechtenstein',
977
+ 'Sri Lanka',
978
+ 'Liberia',
979
+ 'Lithuania',
980
+ 'Luxembourg',
981
+ 'Latvia',
982
+ 'Libya',
983
+ 'Morocco',
984
+ 'Monaco',
985
+ 'Moldova',
986
+ 'Montenegro',
987
+ 'Saint-Martin',
988
+ 'Madagascar',
989
+ 'Macedonia',
990
+ 'Mali',
991
+ 'Myanmar',
992
+ 'Mongolia',
993
+ 'Macau',
994
+ 'Martinique',
995
+ 'Mauritania',
996
+ 'Malta',
997
+ 'Mauritius',
998
+ 'Maldives',
999
+ 'Malawi',
1000
+ 'Mexico',
1001
+ 'Malaysia',
1002
+ 'Mozambique',
1003
+ 'Namibia',
1004
+ 'New Caledonia',
1005
+ 'Nigeria',
1006
+ 'Nicaragua',
1007
+ 'Netherlands',
1008
+ 'Norway',
1009
+ 'Nepal',
1010
+ 'New Zealand',
1011
+ 'Oman',
1012
+ 'Panama',
1013
+ 'Peru',
1014
+ 'French Polynesia',
1015
+ 'Papua New Guinea',
1016
+ 'Philippines',
1017
+ 'Pakistan',
1018
+ 'Poland',
1019
+ 'Puerto Rico',
1020
+ 'Palestine',
1021
+ 'Portugal',
1022
+ 'Palau',
1023
+ 'Paraguay',
1024
+ 'Qatar',
1025
+ 'Reunion',
1026
+ 'Romania',
1027
+ 'Serbia',
1028
+ 'Russia',
1029
+ 'Rwanda',
1030
+ 'Saudi Arabia',
1031
+ 'Solomon Islands',
1032
+ 'Seychelles',
1033
+ 'Sudan',
1034
+ 'Sweden',
1035
+ 'Singapore',
1036
+ 'St. Helena',
1037
+ 'Slovenia',
1038
+ 'Svalbard and Jan Mayen Islands',
1039
+ 'Slovakia',
1040
+ 'Sierra Leone',
1041
+ 'San Marino',
1042
+ 'Senegal',
1043
+ 'Somalia',
1044
+ 'South Sudan',
1045
+ 'El Salvador',
1046
+ 'Sint Maarten',
1047
+ 'Syria',
1048
+ 'Eswatini',
1049
+ 'Togo',
1050
+ 'Thailand',
1051
+ 'Tajikistan',
1052
+ 'Timor-Leste',
1053
+ 'Turkmenistan',
1054
+ 'Tunisia',
1055
+ 'Tonga',
1056
+ 'Turkey',
1057
+ 'Trinidad and Tobago',
1058
+ 'Taiwan',
1059
+ 'Tanzania',
1060
+ 'Ukraine',
1061
+ 'Uganda',
1062
+ 'United States',
1063
+ 'Uruguay',
1064
+ 'Uzbekistan',
1065
+ 'Vatican',
1066
+ 'Venezuela',
1067
+ 'British Virgin Islands',
1068
+ 'United States Virgin Islands',
1069
+ 'Vietnam',
1070
+ 'Vanuatu',
1071
+ 'Samoa',
1072
+ 'Kosovo',
1073
+ 'Yemen',
1074
+ 'South Africa',
1075
+ 'Zambia',
1076
+ 'Zimbabwe',
1077
+ ]
1078
+
1079
+ templates = [
1080
+ 'a photo i took in {}.',
1081
+ 'a photo i took while visiting {}.',
1082
+ 'a photo from my home country of {}.',
1083
+ 'a photo from my visit to {}.',
1084
+ 'a photo showing the country of {}.',
1085
+ ]
1086
+ ```
1087
+
1088
+
1089
+
1090
+ ## DescribableTextures
1091
+
1092
+ ```bash
1093
+ classes = [
1094
+ 'banded',
1095
+ 'blotchy',
1096
+ 'braided',
1097
+ 'bubbly',
1098
+ 'bumpy',
1099
+ 'chequered',
1100
+ 'cobwebbed',
1101
+ 'cracked',
1102
+ 'crosshatched',
1103
+ 'crystalline',
1104
+ 'dotted',
1105
+ 'fibrous',
1106
+ 'flecked',
1107
+ 'freckled',
1108
+ 'frilly',
1109
+ 'gauzy',
1110
+ 'grid',
1111
+ 'grooved',
1112
+ 'honeycombed',
1113
+ 'interlaced',
1114
+ 'knitted',
1115
+ 'lacelike',
1116
+ 'lined',
1117
+ 'marbled',
1118
+ 'matted',
1119
+ 'meshed',
1120
+ 'paisley',
1121
+ 'perforated',
1122
+ 'pitted',
1123
+ 'pleated',
1124
+ 'polka-dotted',
1125
+ 'porous',
1126
+ 'potholed',
1127
+ 'scaly',
1128
+ 'smeared',
1129
+ 'spiralled',
1130
+ 'sprinkled',
1131
+ 'stained',
1132
+ 'stratified',
1133
+ 'striped',
1134
+ 'studded',
1135
+ 'swirly',
1136
+ 'veined',
1137
+ 'waffled',
1138
+ 'woven',
1139
+ 'wrinkled',
1140
+ 'zigzagged',
1141
+ ]
1142
+
1143
+ templates = [
1144
+ 'a photo of a {} texture.',
1145
+ 'a photo of a {} pattern.',
1146
+ 'a photo of a {} thing.',
1147
+ 'a photo of a {} object.',
1148
+ 'a photo of the {} texture.',
1149
+ 'a photo of the {} pattern.',
1150
+ 'a photo of the {} thing.',
1151
+ 'a photo of the {} object.',
1152
+ ]
1153
+ ```
1154
+
1155
+
1156
+
1157
+ ## EuroSAT
1158
+
1159
+ ```bash
1160
+ classes = [
1161
+ 'forest',
1162
+ 'permanent crop land',
1163
+ 'residential buildings or homes or apartments',
1164
+ 'river',
1165
+ 'pasture land',
1166
+ 'lake or sea',
1167
+ 'brushland or shrubland',
1168
+ 'annual crop land',
1169
+ 'industrial buildings or commercial buildings',
1170
+ 'highway or road',
1171
+ ]
1172
+
1173
+ templates = [
1174
+ 'a centered satellite photo of {}.',
1175
+ 'a centered satellite photo of a {}.',
1176
+ 'a centered satellite photo of the {}.',
1177
+ ]
1178
+ ```
1179
+
1180
+
1181
+
1182
+ ## FGVCAircraft
1183
+
1184
+ ```bash
1185
+ classes = [
1186
+ '707-320',
1187
+ '727-200',
1188
+ '737-200',
1189
+ '737-300',
1190
+ '737-400',
1191
+ '737-500',
1192
+ '737-600',
1193
+ '737-700',
1194
+ '737-800',
1195
+ '737-900',
1196
+ '747-100',
1197
+ '747-200',
1198
+ '747-300',
1199
+ '747-400',
1200
+ '757-200',
1201
+ '757-300',
1202
+ '767-200',
1203
+ '767-300',
1204
+ '767-400',
1205
+ '777-200',
1206
+ '777-300',
1207
+ 'A300B4',
1208
+ 'A310',
1209
+ 'A318',
1210
+ 'A319',
1211
+ 'A320',
1212
+ 'A321',
1213
+ 'A330-200',
1214
+ 'A330-300',
1215
+ 'A340-200',
1216
+ 'A340-300',
1217
+ 'A340-500',
1218
+ 'A340-600',
1219
+ 'A380',
1220
+ 'ATR-42',
1221
+ 'ATR-72',
1222
+ 'An-12',
1223
+ 'BAE 146-200',
1224
+ 'BAE 146-300',
1225
+ 'BAE-125',
1226
+ 'Beechcraft 1900',
1227
+ 'Boeing 717',
1228
+ 'C-130',
1229
+ 'C-47',
1230
+ 'CRJ-200',
1231
+ 'CRJ-700',
1232
+ 'CRJ-900',
1233
+ 'Cessna 172',
1234
+ 'Cessna 208',
1235
+ 'Cessna 525',
1236
+ 'Cessna 560',
1237
+ 'Challenger 600',
1238
+ 'DC-10',
1239
+ 'DC-3',
1240
+ 'DC-6',
1241
+ 'DC-8',
1242
+ 'DC-9-30',
1243
+ 'DH-82',
1244
+ 'DHC-1',
1245
+ 'DHC-6',
1246
+ 'DHC-8-100',
1247
+ 'DHC-8-300',
1248
+ 'DR-400',
1249
+ 'Dornier 328',
1250
+ 'E-170',
1251
+ 'E-190',
1252
+ 'E-195',
1253
+ 'EMB-120',
1254
+ 'ERJ 135',
1255
+ 'ERJ 145',
1256
+ 'Embraer Legacy 600',
1257
+ 'Eurofighter Typhoon',
1258
+ 'F-16A/B',
1259
+ 'F/A-18',
1260
+ 'Falcon 2000',
1261
+ 'Falcon 900',
1262
+ 'Fokker 100',
1263
+ 'Fokker 50',
1264
+ 'Fokker 70',
1265
+ 'Global Express',
1266
+ 'Gulfstream IV',
1267
+ 'Gulfstream V',
1268
+ 'Hawk T1',
1269
+ 'Il-76',
1270
+ 'L-1011',
1271
+ 'MD-11',
1272
+ 'MD-80',
1273
+ 'MD-87',
1274
+ 'MD-90',
1275
+ 'Metroliner',
1276
+ 'Model B200',
1277
+ 'PA-28',
1278
+ 'SR-20',
1279
+ 'Saab 2000',
1280
+ 'Saab 340',
1281
+ 'Spitfire',
1282
+ 'Tornado',
1283
+ 'Tu-134',
1284
+ 'Tu-154',
1285
+ 'Yak-42',
1286
+ ]
1287
+
1288
+ templates = [
1289
+ 'a photo of a {}, a type of aircraft.',
1290
+ 'a photo of the {}, a type of aircraft.',
1291
+ ]
1292
+ ```
1293
+
1294
+
1295
+
1296
+ ## FacialEmotionRecognition2013
1297
+
1298
+ ```bash
1299
+ classes = [
1300
+ ['angry'],
1301
+ ['disgusted'],
1302
+ ['fearful'],
1303
+ ['happy', 'smiling'],
1304
+ ['sad', 'depressed'],
1305
+ ['surprised', 'shocked', 'spooked'],
1306
+ ['neutral', 'bored'],
1307
+ ]
1308
+
1309
+ templates = [
1310
+ 'a photo of a {} looking face.',
1311
+ 'a photo of a face showing the emotion: {}.',
1312
+ 'a photo of a face looking {}.',
1313
+ 'a face that looks {}.',
1314
+ 'they look {}.',
1315
+ 'look at how {} they are.',
1316
+ ]
1317
+ ```
1318
+
1319
+
1320
+
1321
+ ## Flowers102
1322
+
1323
+ ```bash
1324
+ classes = [
1325
+ 'pink primrose',
1326
+ 'hard-leaved pocket orchid',
1327
+ 'canterbury bells',
1328
+ 'sweet pea',
1329
+ 'english marigold',
1330
+ 'tiger lily',
1331
+ 'moon orchid',
1332
+ 'bird of paradise',
1333
+ 'monkshood',
1334
+ 'globe thistle',
1335
+ 'snapdragon',
1336
+ "colt's foot",
1337
+ 'king protea',
1338
+ 'spear thistle',
1339
+ 'yellow iris',
1340
+ 'globe flower',
1341
+ 'purple coneflower',
1342
+ 'peruvian lily',
1343
+ 'balloon flower',
1344
+ 'giant white arum lily',
1345
+ 'fire lily',
1346
+ 'pincushion flower',
1347
+ 'fritillary',
1348
+ 'red ginger',
1349
+ 'grape hyacinth',
1350
+ 'corn poppy',
1351
+ 'prince of wales feathers',
1352
+ 'stemless gentian',
1353
+ 'artichoke',
1354
+ 'sweet william',
1355
+ 'carnation',
1356
+ 'garden phlox',
1357
+ 'love in the mist',
1358
+ 'mexican aster',
1359
+ 'alpine sea holly',
1360
+ 'ruby-lipped cattleya',
1361
+ 'cape flower',
1362
+ 'great masterwort',
1363
+ 'siam tulip',
1364
+ 'lenten rose',
1365
+ 'barbeton daisy',
1366
+ 'daffodil',
1367
+ 'sword lily',
1368
+ 'poinsettia',
1369
+ 'bolero deep blue',
1370
+ 'wallflower',
1371
+ 'marigold',
1372
+ 'buttercup',
1373
+ 'oxeye daisy',
1374
+ 'common dandelion',
1375
+ 'petunia',
1376
+ 'wild pansy',
1377
+ 'primula',
1378
+ 'sunflower',
1379
+ 'pelargonium',
1380
+ 'bishop of llandaff',
1381
+ 'gaura',
1382
+ 'geranium',
1383
+ 'orange dahlia',
1384
+ 'pink and yellow dahlia',
1385
+ 'cautleya spicata',
1386
+ 'japanese anemone',
1387
+ 'black-eyed susan',
1388
+ 'silverbush',
1389
+ 'californian poppy',
1390
+ 'osteospermum',
1391
+ 'spring crocus',
1392
+ 'bearded iris',
1393
+ 'windflower',
1394
+ 'tree poppy',
1395
+ 'gazania',
1396
+ 'azalea',
1397
+ 'water lily',
1398
+ 'rose',
1399
+ 'thorn apple',
1400
+ 'morning glory',
1401
+ 'passion flower',
1402
+ 'lotus',
1403
+ 'toad lily',
1404
+ 'anthurium',
1405
+ 'frangipani',
1406
+ 'clematis',
1407
+ 'hibiscus',
1408
+ 'columbine',
1409
+ 'desert-rose',
1410
+ 'tree mallow',
1411
+ 'magnolia',
1412
+ 'cyclamen',
1413
+ 'watercress',
1414
+ 'canna lily',
1415
+ 'hippeastrum',
1416
+ 'bee balm',
1417
+ 'air plant',
1418
+ 'foxglove',
1419
+ 'bougainvillea',
1420
+ 'camellia',
1421
+ 'mallow',
1422
+ 'mexican petunia',
1423
+ 'bromelia',
1424
+ 'blanket flower',
1425
+ 'trumpet creeper',
1426
+ 'blackberry lily',
1427
+ ]
1428
+
1429
+ templates = [
1430
+ 'a photo of a {}, a type of flower.',
1431
+ ]
1432
+ ```
1433
+
1434
+
1435
+
1436
+ ## Food101
1437
+
1438
+ ```bash
1439
+ classes = [
1440
+ 'apple pie',
1441
+ 'baby back ribs',
1442
+ 'baklava',
1443
+ 'beef carpaccio',
1444
+ 'beef tartare',
1445
+ 'beet salad',
1446
+ 'beignets',
1447
+ 'bibimbap',
1448
+ 'bread pudding',
1449
+ 'breakfast burrito',
1450
+ 'bruschetta',
1451
+ 'caesar salad',
1452
+ 'cannoli',
1453
+ 'caprese salad',
1454
+ 'carrot cake',
1455
+ 'ceviche',
1456
+ 'cheese plate',
1457
+ 'cheesecake',
1458
+ 'chicken curry',
1459
+ 'chicken quesadilla',
1460
+ 'chicken wings',
1461
+ 'chocolate cake',
1462
+ 'chocolate mousse',
1463
+ 'churros',
1464
+ 'clam chowder',
1465
+ 'club sandwich',
1466
+ 'crab cakes',
1467
+ 'creme brulee',
1468
+ 'croque madame',
1469
+ 'cup cakes',
1470
+ 'deviled eggs',
1471
+ 'donuts',
1472
+ 'dumplings',
1473
+ 'edamame',
1474
+ 'eggs benedict',
1475
+ 'escargots',
1476
+ 'falafel',
1477
+ 'filet mignon',
1478
+ 'fish and chips',
1479
+ 'foie gras',
1480
+ 'french fries',
1481
+ 'french onion soup',
1482
+ 'french toast',
1483
+ 'fried calamari',
1484
+ 'fried rice',
1485
+ 'frozen yogurt',
1486
+ 'garlic bread',
1487
+ 'gnocchi',
1488
+ 'greek salad',
1489
+ 'grilled cheese sandwich',
1490
+ 'grilled salmon',
1491
+ 'guacamole',
1492
+ 'gyoza',
1493
+ 'hamburger',
1494
+ 'hot and sour soup',
1495
+ 'hot dog',
1496
+ 'huevos rancheros',
1497
+ 'hummus',
1498
+ 'ice cream',
1499
+ 'lasagna',
1500
+ 'lobster bisque',
1501
+ 'lobster roll sandwich',
1502
+ 'macaroni and cheese',
1503
+ 'macarons',
1504
+ 'miso soup',
1505
+ 'mussels',
1506
+ 'nachos',
1507
+ 'omelette',
1508
+ 'onion rings',
1509
+ 'oysters',
1510
+ 'pad thai',
1511
+ 'paella',
1512
+ 'pancakes',
1513
+ 'panna cotta',
1514
+ 'peking duck',
1515
+ 'pho',
1516
+ 'pizza',
1517
+ 'pork chop',
1518
+ 'poutine',
1519
+ 'prime rib',
1520
+ 'pulled pork sandwich',
1521
+ 'ramen',
1522
+ 'ravioli',
1523
+ 'red velvet cake',
1524
+ 'risotto',
1525
+ 'samosa',
1526
+ 'sashimi',
1527
+ 'scallops',
1528
+ 'seaweed salad',
1529
+ 'shrimp and grits',
1530
+ 'spaghetti bolognese',
1531
+ 'spaghetti carbonara',
1532
+ 'spring rolls',
1533
+ 'steak',
1534
+ 'strawberry shortcake',
1535
+ 'sushi',
1536
+ 'tacos',
1537
+ 'takoyaki',
1538
+ 'tiramisu',
1539
+ 'tuna tartare',
1540
+ 'waffles',
1541
+ ]
1542
+
1543
+ templates = [
1544
+ 'a photo of {}, a type of food.',
1545
+ ]
1546
+ ```
1547
+
1548
+
1549
+
1550
+ ## GTSRB
1551
+
1552
+ ```bash
1553
+ classes = [
1554
+ 'red and white circle 20 kph speed limit',
1555
+ 'red and white circle 30 kph speed limit',
1556
+ 'red and white circle 50 kph speed limit',
1557
+ 'red and white circle 60 kph speed limit',
1558
+ 'red and white circle 70 kph speed limit',
1559
+ 'red and white circle 80 kph speed limit',
1560
+ 'end / de-restriction of 80 kph speed limit',
1561
+ 'red and white circle 100 kph speed limit',
1562
+ 'red and white circle 120 kph speed limit',
1563
+ 'red and white circle red car and black car no passing',
1564
+ 'red and white circle red truck and black car no passing',
1565
+ 'red and white triangle road intersection warning',
1566
+ 'white and yellow diamond priority road',
1567
+ 'red and white upside down triangle yield right-of-way',
1568
+ 'stop',
1569
+ 'empty red and white circle',
1570
+ 'red and white circle no truck entry',
1571
+ 'red circle with white horizonal stripe no entry',
1572
+ 'red and white triangle with exclamation mark warning',
1573
+ 'red and white triangle with black left curve approaching warning',
1574
+ 'red and white triangle with black right curve approaching warning',
1575
+ 'red and white triangle with black double curve approaching warning',
1576
+ 'red and white triangle rough / bumpy road warning',
1577
+ 'red and white triangle car skidding / slipping warning',
1578
+ 'red and white triangle with merging / narrow lanes warning',
1579
+ 'red and white triangle with person digging / construction / road work warning',
1580
+ 'red and white triangle with traffic light approaching warning',
1581
+ 'red and white triangle with person walking warning',
1582
+ 'red and white triangle with child and person walking warning',
1583
+ 'red and white triangle with bicyle warning',
1584
+ 'red and white triangle with snowflake / ice warning',
1585
+ 'red and white triangle with deer warning',
1586
+ 'white circle with gray strike bar no speed limit',
1587
+ 'blue circle with white right turn arrow mandatory',
1588
+ 'blue circle with white left turn arrow mandatory',
1589
+ 'blue circle with white forward arrow mandatory',
1590
+ 'blue circle with white forward or right turn arrow mandatory',
1591
+ 'blue circle with white forward or left turn arrow mandatory',
1592
+ 'blue circle with white keep right arrow mandatory',
1593
+ 'blue circle with white keep left arrow mandatory',
1594
+ 'blue circle with white arrows indicating a traffic circle',
1595
+ 'white circle with gray strike bar indicating no passing for cars has ended',
1596
+ 'white circle with gray strike bar indicating no passing for trucks has ended',
1597
+ ]
1598
+
1599
+ templates = [
1600
+ 'a zoomed in photo of a "{}" traffic sign.',
1601
+ 'a centered photo of a "{}" traffic sign.',
1602
+ 'a close up photo of a "{}" traffic sign.',
1603
+ ]
1604
+ ```
1605
+
1606
+
1607
+
1608
+ ## HatefulMemes
1609
+
1610
+ ```bash
1611
+ classes = [
1612
+ 'meme',
1613
+ 'hatespeech meme',
1614
+ ]
1615
+
1616
+ templates = [
1617
+ 'a {}.',
1618
+ ]
1619
+ ```
1620
+
1621
+
1622
+
1623
+ ## KITTI
1624
+
1625
+ ```bash
1626
+ classes = [
1627
+ 'a photo i took of a car on my left or right side.',
1628
+ 'a photo i took with a car nearby.',
1629
+ 'a photo i took with a car in the distance.',
1630
+ 'a photo i took with no car.',
1631
+ ]
1632
+
1633
+ templates = [
1634
+ '{}',
1635
+ ]
1636
+ ```
1637
+
1638
+
1639
+
1640
+ ## Kinetics700
1641
+
1642
+ ```bash
1643
+ classes = [
1644
+ 'abseiling',
1645
+ 'acting in play',
1646
+ 'adjusting glasses',
1647
+ 'air drumming',
1648
+ 'alligator wrestling',
1649
+ 'answering questions',
1650
+ 'applauding',
1651
+ 'applying cream',
1652
+ 'archaeological excavation',
1653
+ 'archery',
1654
+ 'arguing',
1655
+ 'arm wrestling',
1656
+ 'arranging flowers',
1657
+ 'arresting',
1658
+ 'assembling bicycle',
1659
+ 'assembling computer',
1660
+ 'attending conference',
1661
+ 'auctioning',
1662
+ 'baby waking up',
1663
+ 'backflip (human)',
1664
+ 'baking cookies',
1665
+ 'bandaging',
1666
+ 'barbequing',
1667
+ 'bartending',
1668
+ 'base jumping',
1669
+ 'bathing dog',
1670
+ 'battle rope training',
1671
+ 'beatboxing',
1672
+ 'bee keeping',
1673
+ 'being excited',
1674
+ 'being in zero gravity',
1675
+ 'belly dancing',
1676
+ 'bench pressing',
1677
+ 'bending back',
1678
+ 'bending metal',
1679
+ 'biking through snow',
1680
+ 'blasting sand',
1681
+ 'blending fruit',
1682
+ 'blowdrying hair',
1683
+ 'blowing bubble gum',
1684
+ 'blowing glass',
1685
+ 'blowing leaves',
1686
+ 'blowing nose',
1687
+ 'blowing out candles',
1688
+ 'bobsledding',
1689
+ 'bodysurfing',
1690
+ 'bookbinding',
1691
+ 'bottling',
1692
+ 'bouncing ball (not juggling)',
1693
+ 'bouncing on bouncy castle',
1694
+ 'bouncing on trampoline',
1695
+ 'bowling',
1696
+ 'braiding hair',
1697
+ 'breading or breadcrumbing',
1698
+ 'breakdancing',
1699
+ 'breaking boards',
1700
+ 'breaking glass',
1701
+ 'breathing fire',
1702
+ 'brush painting',
1703
+ 'brushing floor',
1704
+ 'brushing hair',
1705
+ 'brushing teeth',
1706
+ 'building cabinet',
1707
+ 'building lego',
1708
+ 'building sandcastle',
1709
+ 'building shed',
1710
+ 'bulldozing',
1711
+ 'bungee jumping',
1712
+ 'burping',
1713
+ 'busking',
1714
+ 'calculating',
1715
+ 'calligraphy',
1716
+ 'canoeing or kayaking',
1717
+ 'capoeira',
1718
+ 'capsizing',
1719
+ 'card stacking',
1720
+ 'card throwing',
1721
+ 'carrying baby',
1722
+ 'carrying weight',
1723
+ 'cartwheeling',
1724
+ 'carving ice',
1725
+ 'carving marble',
1726
+ 'carving pumpkin',
1727
+ 'carving wood with a knife',
1728
+ 'casting fishing line',
1729
+ 'catching fish',
1730
+ 'catching or throwing baseball',
1731
+ 'catching or throwing frisbee',
1732
+ 'catching or throwing softball',
1733
+ 'celebrating',
1734
+ 'changing gear in car',
1735
+ 'changing oil',
1736
+ 'changing wheel (not on bike)',
1737
+ 'chasing',
1738
+ 'checking tires',
1739
+ 'checking watch',
1740
+ 'cheerleading',
1741
+ 'chewing gum',
1742
+ 'chiseling stone',
1743
+ 'chiseling wood',
1744
+ 'chopping meat',
1745
+ 'chopping wood',
1746
+ 'clam digging',
1747
+ 'clapping',
1748
+ 'clay pottery making',
1749
+ 'clean and jerk',
1750
+ 'cleaning gutters',
1751
+ 'cleaning pool',
1752
+ 'cleaning shoes',
1753
+ 'cleaning toilet',
1754
+ 'cleaning windows',
1755
+ 'climbing a rope',
1756
+ 'climbing ladder',
1757
+ 'climbing tree',
1758
+ 'closing door',
1759
+ 'coloring in',
1760
+ 'combing hair',
1761
+ 'contact juggling',
1762
+ 'contorting',
1763
+ 'cooking chicken',
1764
+ 'cooking egg',
1765
+ 'cooking on campfire',
1766
+ 'cooking sausages (not on barbeque)',
1767
+ 'cooking scallops',
1768
+ 'cosplaying',
1769
+ 'coughing',
1770
+ 'counting money',
1771
+ 'country line dancing',
1772
+ 'cracking back',
1773
+ 'cracking knuckles',
1774
+ 'cracking neck',
1775
+ 'crawling baby',
1776
+ 'crocheting',
1777
+ 'crossing eyes',
1778
+ 'crossing river',
1779
+ 'crying',
1780
+ 'cumbia',
1781
+ 'curling (sport)',
1782
+ 'curling eyelashes',
1783
+ 'curling hair',
1784
+ 'cutting apple',
1785
+ 'cutting cake',
1786
+ 'cutting nails',
1787
+ 'cutting orange',
1788
+ 'cutting pineapple',
1789
+ 'cutting watermelon',
1790
+ 'dancing ballet',
1791
+ 'dancing charleston',
1792
+ 'dancing gangnam style',
1793
+ 'dancing macarena',
1794
+ 'deadlifting',
1795
+ 'dealing cards',
1796
+ 'decorating the christmas tree',
1797
+ 'decoupage',
1798
+ 'delivering mail',
1799
+ 'digging',
1800
+ 'dining',
1801
+ 'directing traffic',
1802
+ 'disc golfing',
1803
+ 'diving cliff',
1804
+ 'docking boat',
1805
+ 'dodgeball',
1806
+ 'doing aerobics',
1807
+ 'doing jigsaw puzzle',
1808
+ 'doing laundry',
1809
+ 'doing nails',
1810
+ 'doing sudoku',
1811
+ 'drawing',
1812
+ 'dribbling basketball',
1813
+ 'drinking shots',
1814
+ 'driving car',
1815
+ 'driving tractor',
1816
+ 'drooling',
1817
+ 'drop kicking',
1818
+ 'drumming fingers',
1819
+ 'dumpster diving',
1820
+ 'dunking basketball',
1821
+ 'dyeing eyebrows',
1822
+ 'dyeing hair',
1823
+ 'eating burger',
1824
+ 'eating cake',
1825
+ 'eating carrots',
1826
+ 'eating chips',
1827
+ 'eating doughnuts',
1828
+ 'eating hotdog',
1829
+ 'eating ice cream',
1830
+ 'eating nachos',
1831
+ 'eating spaghetti',
1832
+ 'eating watermelon',
1833
+ 'egg hunting',
1834
+ 'embroidering',
1835
+ 'entering church',
1836
+ 'exercising arm',
1837
+ 'exercising with an exercise ball',
1838
+ 'extinguishing fire',
1839
+ 'faceplanting',
1840
+ 'falling off bike',
1841
+ 'falling off chair',
1842
+ 'feeding birds',
1843
+ 'feeding fish',
1844
+ 'feeding goats',
1845
+ 'fencing (sport)',
1846
+ 'fidgeting',
1847
+ 'filling cake',
1848
+ 'filling eyebrows',
1849
+ 'finger snapping',
1850
+ 'fixing bicycle',
1851
+ 'fixing hair',
1852
+ 'flint knapping',
1853
+ 'flipping bottle',
1854
+ 'flipping pancake',
1855
+ 'fly tying',
1856
+ 'flying kite',
1857
+ 'folding clothes',
1858
+ 'folding napkins',
1859
+ 'folding paper',
1860
+ 'front raises',
1861
+ 'frying vegetables',
1862
+ 'gargling',
1863
+ 'geocaching',
1864
+ 'getting a haircut',
1865
+ 'getting a piercing',
1866
+ 'getting a tattoo',
1867
+ 'giving or receiving award',
1868
+ 'gold panning',
1869
+ 'golf chipping',
1870
+ 'golf driving',
1871
+ 'golf putting',
1872
+ 'gospel singing in church',
1873
+ 'grinding meat',
1874
+ 'grooming cat',
1875
+ 'grooming dog',
1876
+ 'grooming horse',
1877
+ 'gymnastics tumbling',
1878
+ 'hammer throw',
1879
+ 'hand washing clothes',
1880
+ 'head stand',
1881
+ 'headbanging',
1882
+ 'headbutting',
1883
+ 'helmet diving',
1884
+ 'herding cattle',
1885
+ 'high fiving',
1886
+ 'high jump',
1887
+ 'high kick',
1888
+ 'historical reenactment',
1889
+ 'hitting baseball',
1890
+ 'hockey stop',
1891
+ 'holding snake',
1892
+ 'home roasting coffee',
1893
+ 'hopscotch',
1894
+ 'hoverboarding',
1895
+ 'huddling',
1896
+ 'hugging (not baby)',
1897
+ 'hugging baby',
1898
+ 'hula hooping',
1899
+ 'hurdling',
1900
+ 'hurling (sport)',
1901
+ 'ice climbing',
1902
+ 'ice fishing',
1903
+ 'ice skating',
1904
+ 'ice swimming',
1905
+ 'inflating balloons',
1906
+ 'installing carpet',
1907
+ 'ironing',
1908
+ 'ironing hair',
1909
+ 'javelin throw',
1910
+ 'jaywalking',
1911
+ 'jetskiing',
1912
+ 'jogging',
1913
+ 'juggling balls',
1914
+ 'juggling fire',
1915
+ 'juggling soccer ball',
1916
+ 'jumping bicycle',
1917
+ 'jumping into pool',
1918
+ 'jumping jacks',
1919
+ 'jumping sofa',
1920
+ 'jumpstyle dancing',
1921
+ 'karaoke',
1922
+ 'kicking field goal',
1923
+ 'kicking soccer ball',
1924
+ 'kissing',
1925
+ 'kitesurfing',
1926
+ 'knitting',
1927
+ 'krumping',
1928
+ 'land sailing',
1929
+ 'laughing',
1930
+ 'lawn mower racing',
1931
+ 'laying bricks',
1932
+ 'laying concrete',
1933
+ 'laying decking',
1934
+ 'laying stone',
1935
+ 'laying tiles',
1936
+ 'leatherworking',
1937
+ 'letting go of balloon',
1938
+ 'licking',
1939
+ 'lifting hat',
1940
+ 'lighting candle',
1941
+ 'lighting fire',
1942
+ 'listening with headphones',
1943
+ 'lock picking',
1944
+ 'long jump',
1945
+ 'longboarding',
1946
+ 'looking at phone',
1947
+ 'looking in mirror',
1948
+ 'luge',
1949
+ 'lunge',
1950
+ 'making a cake',
1951
+ 'making a sandwich',
1952
+ 'making balloon shapes',
1953
+ 'making bubbles',
1954
+ 'making cheese',
1955
+ 'making horseshoes',
1956
+ 'making jewelry',
1957
+ 'making latte art',
1958
+ 'making paper aeroplanes',
1959
+ 'making pizza',
1960
+ 'making slime',
1961
+ 'making snowman',
1962
+ 'making sushi',
1963
+ 'making tea',
1964
+ 'making the bed',
1965
+ 'marching',
1966
+ 'marriage proposal',
1967
+ 'massaging back',
1968
+ 'massaging feet',
1969
+ 'massaging legs',
1970
+ 'massaging neck',
1971
+ "massaging person's head",
1972
+ 'metal detecting',
1973
+ 'milking cow',
1974
+ 'milking goat',
1975
+ 'mixing colours',
1976
+ 'moon walking',
1977
+ 'mopping floor',
1978
+ 'mosh pit dancing',
1979
+ 'motorcycling',
1980
+ 'mountain climber (exercise)',
1981
+ 'moving baby',
1982
+ 'moving child',
1983
+ 'moving furniture',
1984
+ 'mowing lawn',
1985
+ 'mushroom foraging',
1986
+ 'needle felting',
1987
+ 'news anchoring',
1988
+ 'opening bottle (not wine)',
1989
+ 'opening coconuts',
1990
+ 'opening door',
1991
+ 'opening present',
1992
+ 'opening refrigerator',
1993
+ 'opening wine bottle',
1994
+ 'packing',
1995
+ 'paragliding',
1996
+ 'parasailing',
1997
+ 'parkour',
1998
+ 'passing American football (in game)',
1999
+ 'passing American football (not in game)',
2000
+ 'passing soccer ball',
2001
+ 'peeling apples',
2002
+ 'peeling banana',
2003
+ 'peeling potatoes',
2004
+ 'person collecting garbage',
2005
+ 'petting animal (not cat)',
2006
+ 'petting cat',
2007
+ 'petting horse',
2008
+ 'photobombing',
2009
+ 'photocopying',
2010
+ 'picking apples',
2011
+ 'picking blueberries',
2012
+ 'pillow fight',
2013
+ 'pinching',
2014
+ 'pirouetting',
2015
+ 'planing wood',
2016
+ 'planting trees',
2017
+ 'plastering',
2018
+ 'playing accordion',
2019
+ 'playing american football',
2020
+ 'playing badminton',
2021
+ 'playing bagpipes',
2022
+ 'playing basketball',
2023
+ 'playing bass guitar',
2024
+ 'playing beer pong',
2025
+ 'playing billiards',
2026
+ 'playing blackjack',
2027
+ 'playing cards',
2028
+ 'playing cello',
2029
+ 'playing checkers',
2030
+ 'playing chess',
2031
+ 'playing clarinet',
2032
+ 'playing controller',
2033
+ 'playing cricket',
2034
+ 'playing cymbals',
2035
+ 'playing darts',
2036
+ 'playing didgeridoo',
2037
+ 'playing dominoes',
2038
+ 'playing drums',
2039
+ 'playing field hockey',
2040
+ 'playing flute',
2041
+ 'playing gong',
2042
+ 'playing guitar',
2043
+ 'playing hand clapping games',
2044
+ 'playing harmonica',
2045
+ 'playing harp',
2046
+ 'playing ice hockey',
2047
+ 'playing keyboard',
2048
+ 'playing kickball',
2049
+ 'playing laser tag',
2050
+ 'playing lute',
2051
+ 'playing mahjong',
2052
+ 'playing maracas',
2053
+ 'playing marbles',
2054
+ 'playing monopoly',
2055
+ 'playing netball',
2056
+ 'playing nose flute',
2057
+ 'playing oboe',
2058
+ 'playing ocarina',
2059
+ 'playing organ',
2060
+ 'playing paintball',
2061
+ 'playing pan pipes',
2062
+ 'playing piano',
2063
+ 'playing piccolo',
2064
+ 'playing pinball',
2065
+ 'playing ping pong',
2066
+ 'playing poker',
2067
+ 'playing polo',
2068
+ 'playing recorder',
2069
+ 'playing road hockey',
2070
+ 'playing rounders',
2071
+ 'playing rubiks cube',
2072
+ 'playing saxophone',
2073
+ 'playing scrabble',
2074
+ 'playing shuffleboard',
2075
+ 'playing slot machine',
2076
+ 'playing squash or racquetball',
2077
+ 'playing tennis',
2078
+ 'playing trombone',
2079
+ 'playing trumpet',
2080
+ 'playing ukulele',
2081
+ 'playing violin',
2082
+ 'playing volleyball',
2083
+ 'playing with trains',
2084
+ 'playing xylophone',
2085
+ 'poaching eggs',
2086
+ 'poking bellybutton',
2087
+ 'pole vault',
2088
+ 'polishing furniture',
2089
+ 'polishing metal',
2090
+ 'popping balloons',
2091
+ 'pouring beer',
2092
+ 'pouring milk',
2093
+ 'pouring wine',
2094
+ 'preparing salad',
2095
+ 'presenting weather forecast',
2096
+ 'pretending to be a statue',
2097
+ 'pull ups',
2098
+ 'pulling espresso shot',
2099
+ 'pulling rope (game)',
2100
+ 'pumping fist',
2101
+ 'pumping gas',
2102
+ 'punching bag',
2103
+ 'punching person (boxing)',
2104
+ 'push up',
2105
+ 'pushing car',
2106
+ 'pushing cart',
2107
+ 'pushing wheelbarrow',
2108
+ 'pushing wheelchair',
2109
+ 'putting in contact lenses',
2110
+ 'putting on eyeliner',
2111
+ 'putting on foundation',
2112
+ 'putting on lipstick',
2113
+ 'putting on mascara',
2114
+ 'putting on sari',
2115
+ 'putting on shoes',
2116
+ 'putting wallpaper on wall',
2117
+ 'raising eyebrows',
2118
+ 'reading book',
2119
+ 'reading newspaper',
2120
+ 'recording music',
2121
+ 'repairing puncture',
2122
+ 'riding a bike',
2123
+ 'riding camel',
2124
+ 'riding elephant',
2125
+ 'riding mechanical bull',
2126
+ 'riding mule',
2127
+ 'riding or walking with horse',
2128
+ 'riding scooter',
2129
+ 'riding snow blower',
2130
+ 'riding unicycle',
2131
+ 'ripping paper',
2132
+ 'roasting marshmallows',
2133
+ 'roasting pig',
2134
+ 'robot dancing',
2135
+ 'rock climbing',
2136
+ 'rock scissors paper',
2137
+ 'roller skating',
2138
+ 'rolling eyes',
2139
+ 'rolling pastry',
2140
+ 'rope pushdown',
2141
+ 'running on treadmill',
2142
+ 'sailing',
2143
+ 'salsa dancing',
2144
+ 'saluting',
2145
+ 'sanding floor',
2146
+ 'sanding wood',
2147
+ 'sausage making',
2148
+ 'sawing wood',
2149
+ 'scrambling eggs',
2150
+ 'scrapbooking',
2151
+ 'scrubbing face',
2152
+ 'scuba diving',
2153
+ 'seasoning food',
2154
+ 'separating eggs',
2155
+ 'setting table',
2156
+ 'sewing',
2157
+ 'shaking hands',
2158
+ 'shaking head',
2159
+ 'shaping bread dough',
2160
+ 'sharpening knives',
2161
+ 'sharpening pencil',
2162
+ 'shaving head',
2163
+ 'shaving legs',
2164
+ 'shearing sheep',
2165
+ 'shining flashlight',
2166
+ 'shining shoes',
2167
+ 'shoot dance',
2168
+ 'shooting basketball',
2169
+ 'shooting goal (soccer)',
2170
+ 'shooting off fireworks',
2171
+ 'shopping',
2172
+ 'shot put',
2173
+ 'shouting',
2174
+ 'shoveling snow',
2175
+ 'shredding paper',
2176
+ 'shucking oysters',
2177
+ 'shuffling cards',
2178
+ 'shuffling feet',
2179
+ 'side kick',
2180
+ 'sieving',
2181
+ 'sign language interpreting',
2182
+ 'silent disco',
2183
+ 'singing',
2184
+ 'sipping cup',
2185
+ 'situp',
2186
+ 'skateboarding',
2187
+ 'ski ballet',
2188
+ 'ski jumping',
2189
+ 'skiing crosscountry',
2190
+ 'skiing mono',
2191
+ 'skiing slalom',
2192
+ 'skipping rope',
2193
+ 'skipping stone',
2194
+ 'skydiving',
2195
+ 'slacklining',
2196
+ 'slapping',
2197
+ 'sled dog racing',
2198
+ 'sleeping',
2199
+ 'slicing onion',
2200
+ 'smashing',
2201
+ 'smelling feet',
2202
+ 'smoking',
2203
+ 'smoking hookah',
2204
+ 'smoking pipe',
2205
+ 'snatch weight lifting',
2206
+ 'sneezing',
2207
+ 'snorkeling',
2208
+ 'snowboarding',
2209
+ 'snowkiting',
2210
+ 'snowmobiling',
2211
+ 'somersaulting',
2212
+ 'spelunking',
2213
+ 'spinning plates',
2214
+ 'spinning poi',
2215
+ 'splashing water',
2216
+ 'spray painting',
2217
+ 'spraying',
2218
+ 'springboard diving',
2219
+ 'square dancing',
2220
+ 'squat',
2221
+ 'squeezing orange',
2222
+ 'stacking cups',
2223
+ 'stacking dice',
2224
+ 'standing on hands',
2225
+ 'staring',
2226
+ 'steer roping',
2227
+ 'steering car',
2228
+ 'sticking tongue out',
2229
+ 'stomping grapes',
2230
+ 'stretching arm',
2231
+ 'stretching leg',
2232
+ 'sucking lolly',
2233
+ 'surfing crowd',
2234
+ 'surfing water',
2235
+ 'surveying',
2236
+ 'sweeping floor',
2237
+ 'swimming backstroke',
2238
+ 'swimming breast stroke',
2239
+ 'swimming butterfly stroke',
2240
+ 'swimming front crawl',
2241
+ 'swimming with dolphins',
2242
+ 'swimming with sharks',
2243
+ 'swing dancing',
2244
+ 'swinging baseball bat',
2245
+ 'swinging on something',
2246
+ 'sword fighting',
2247
+ 'sword swallowing',
2248
+ 'tackling',
2249
+ 'tagging graffiti',
2250
+ 'tai chi',
2251
+ 'taking photo',
2252
+ 'talking on cell phone',
2253
+ 'tango dancing',
2254
+ 'tap dancing',
2255
+ 'tapping guitar',
2256
+ 'tapping pen',
2257
+ 'tasting beer',
2258
+ 'tasting food',
2259
+ 'tasting wine',
2260
+ 'testifying',
2261
+ 'texting',
2262
+ 'threading needle',
2263
+ 'throwing axe',
2264
+ 'throwing ball (not baseball or American football)',
2265
+ 'throwing discus',
2266
+ 'throwing knife',
2267
+ 'throwing snowballs',
2268
+ 'throwing tantrum',
2269
+ 'throwing water balloon',
2270
+ 'tickling',
2271
+ 'tie dying',
2272
+ 'tightrope walking',
2273
+ 'tiptoeing',
2274
+ 'tobogganing',
2275
+ 'tossing coin',
2276
+ 'tossing salad',
2277
+ 'training dog',
2278
+ 'trapezing',
2279
+ 'treating wood',
2280
+ 'trimming or shaving beard',
2281
+ 'trimming shrubs',
2282
+ 'trimming trees',
2283
+ 'triple jump',
2284
+ 'twiddling fingers',
2285
+ 'tying bow tie',
2286
+ 'tying knot (not on a tie)',
2287
+ 'tying necktie',
2288
+ 'tying shoe laces',
2289
+ 'unboxing',
2290
+ 'uncorking champagne',
2291
+ 'unloading truck',
2292
+ 'using a microscope',
2293
+ 'using a paint roller',
2294
+ 'using a power drill',
2295
+ 'using a sledge hammer',
2296
+ 'using a wrench',
2297
+ 'using atm',
2298
+ 'using bagging machine',
2299
+ 'using circular saw',
2300
+ 'using inhaler',
2301
+ 'using megaphone',
2302
+ 'using puppets',
2303
+ 'using remote controller (not gaming)',
2304
+ 'using segway',
2305
+ 'vacuuming car',
2306
+ 'vacuuming floor',
2307
+ 'visiting the zoo',
2308
+ 'wading through mud',
2309
+ 'wading through water',
2310
+ 'waiting in line',
2311
+ 'waking up',
2312
+ 'walking on stilts',
2313
+ 'walking the dog',
2314
+ 'walking through snow',
2315
+ 'walking with crutches',
2316
+ 'washing dishes',
2317
+ 'washing feet',
2318
+ 'washing hair',
2319
+ 'washing hands',
2320
+ 'watching tv',
2321
+ 'water skiing',
2322
+ 'water sliding',
2323
+ 'watering plants',
2324
+ 'waving hand',
2325
+ 'waxing armpits',
2326
+ 'waxing back',
2327
+ 'waxing chest',
2328
+ 'waxing eyebrows',
2329
+ 'waxing legs',
2330
+ 'weaving basket',
2331
+ 'weaving fabric',
2332
+ 'welding',
2333
+ 'whistling',
2334
+ 'windsurfing',
2335
+ 'winking',
2336
+ 'wood burning (art)',
2337
+ 'wrapping present',
2338
+ 'wrestling',
2339
+ 'writing',
2340
+ 'yarn spinning',
2341
+ 'yawning',
2342
+ 'yoga',
2343
+ 'zumba'
2344
+ ]
2345
+
2346
+ templates = [
2347
+ 'a photo of {}.',
2348
+ 'a photo of a person {}.',
2349
+ 'a photo of a person using {}.',
2350
+ 'a photo of a person doing {}.',
2351
+ 'a photo of a person during {}.',
2352
+ 'a photo of a person performing {}.',
2353
+ 'a photo of a person practicing {}.',
2354
+ 'a video of {}.',
2355
+ 'a video of a person {}.',
2356
+ 'a video of a person using {}.',
2357
+ 'a video of a person doing {}.',
2358
+ 'a video of a person during {}.',
2359
+ 'a video of a person performing {}.',
2360
+ 'a video of a person practicing {}.',
2361
+ 'a example of {}.',
2362
+ 'a example of a person {}.',
2363
+ 'a example of a person using {}.',
2364
+ 'a example of a person doing {}.',
2365
+ 'a example of a person during {}.',
2366
+ 'a example of a person performing {}.',
2367
+ 'a example of a person practicing {}.',
2368
+ 'a demonstration of {}.',
2369
+ 'a demonstration of a person {}.',
2370
+ 'a demonstration of a person using {}.',
2371
+ 'a demonstration of a person doing {}.',
2372
+ 'a demonstration of a person during {}.',
2373
+ 'a demonstration of a person performing {}.',
2374
+ 'a demonstration of a person practicing {}.',
2375
+ ]
2376
+ ```
2377
+
2378
+
2379
+
2380
+ ## MNIST
2381
+
2382
+ ```bash
2383
+ classes = [
2384
+ '0',
2385
+ '1',
2386
+ '2',
2387
+ '3',
2388
+ '4',
2389
+ '5',
2390
+ '6',
2391
+ '7',
2392
+ '8',
2393
+ '9',
2394
+ ]
2395
+
2396
+ templates = [
2397
+ 'a photo of the number: "{}".',
2398
+ ]
2399
+ ```
2400
+
2401
+
2402
+
2403
+ ## OxfordPets
2404
+
2405
+ ```bash
2406
+ classes = [
2407
+ 'Abyssinian',
2408
+ 'Bengal',
2409
+ 'Birman',
2410
+ 'Bombay',
2411
+ 'British Shorthair',
2412
+ 'Egyptian Mau',
2413
+ 'Maine Coon',
2414
+ 'Persian',
2415
+ 'Ragdoll',
2416
+ 'Russian Blue',
2417
+ 'Siamese',
2418
+ 'Sphynx',
2419
+ 'american bulldog',
2420
+ 'american pit bull terrier',
2421
+ 'basset hound',
2422
+ 'beagle',
2423
+ 'boxer',
2424
+ 'chihuahua',
2425
+ 'english cocker spaniel',
2426
+ 'english setter',
2427
+ 'german shorthaired',
2428
+ 'great pyrenees',
2429
+ 'havanese',
2430
+ 'japanese chin',
2431
+ 'keeshond',
2432
+ 'leonberger',
2433
+ 'miniature pinscher',
2434
+ 'newfoundland',
2435
+ 'pomeranian',
2436
+ 'pug',
2437
+ 'saint bernard',
2438
+ 'samoyed',
2439
+ 'scottish terrier',
2440
+ 'shiba inu',
2441
+ 'staffordshire bull terrier',
2442
+ 'wheaten terrier',
2443
+ 'yorkshire terrier',
2444
+ ]
2445
+
2446
+ templates = [
2447
+ 'a photo of a {}, a type of pet.',
2448
+ ]
2449
+ ```
2450
+
2451
+
2452
+
2453
+ ## PascalVOC2007
2454
+
2455
+ ```bash
2456
+ classes = [
2457
+ 'aeroplane',
2458
+ 'bicycle',
2459
+ 'bird',
2460
+ 'boat',
2461
+ 'bottle',
2462
+ 'bus',
2463
+ 'car',
2464
+ 'cat',
2465
+ 'chair',
2466
+ 'cow',
2467
+ 'dog',
2468
+ 'horse',
2469
+ 'motorbike',
2470
+ 'person',
2471
+ 'sheep',
2472
+ 'sofa',
2473
+ 'diningtable',
2474
+ 'pottedplant',
2475
+ 'train',
2476
+ 'tvmonitor',
2477
+ ]
2478
+
2479
+ templates = [
2480
+ 'a photo of a {}.',
2481
+ ]
2482
+ ```
2483
+
2484
+
2485
+
2486
+ ## PatchCamelyon
2487
+
2488
+ ```bash
2489
+ classes = [
2490
+ 'lymph node',
2491
+ 'lymph node containing metastatic tumor tissue',
2492
+ ]
2493
+
2494
+ templates = [
2495
+ 'this is a photo of {}',
2496
+ ]
2497
+ ```
2498
+
2499
+
2500
+
2501
+ ## RESISC45
2502
+
2503
+ ```bash
2504
+ classes = [
2505
+ 'airplane',
2506
+ 'airport',
2507
+ 'baseball diamond',
2508
+ 'basketball court',
2509
+ 'beach',
2510
+ 'bridge',
2511
+ 'chaparral',
2512
+ 'church',
2513
+ 'circular farmland',
2514
+ 'cloud',
2515
+ 'commercial area',
2516
+ 'dense residential',
2517
+ 'desert',
2518
+ 'forest',
2519
+ 'freeway',
2520
+ 'golf course',
2521
+ 'ground track field',
2522
+ 'harbor',
2523
+ 'industrial area',
2524
+ 'intersection',
2525
+ 'island',
2526
+ 'lake',
2527
+ 'meadow',
2528
+ 'medium residential',
2529
+ 'mobile home park',
2530
+ 'mountain',
2531
+ 'overpass',
2532
+ 'palace',
2533
+ 'parking lot',
2534
+ 'railway',
2535
+ 'railway station',
2536
+ 'rectangular farmland',
2537
+ 'river',
2538
+ 'roundabout',
2539
+ 'runway',
2540
+ 'sea ice',
2541
+ 'ship',
2542
+ 'snowberg',
2543
+ 'sparse residential',
2544
+ 'stadium',
2545
+ 'storage tank',
2546
+ 'tennis court',
2547
+ 'terrace',
2548
+ 'thermal power station',
2549
+ 'wetland',
2550
+ ]
2551
+
2552
+ templates = [
2553
+ 'satellite imagery of {}.',
2554
+ 'aerial imagery of {}.',
2555
+ 'satellite photo of {}.',
2556
+ 'aerial photo of {}.',
2557
+ 'satellite view of {}.',
2558
+ 'aerial view of {}.',
2559
+ 'satellite imagery of a {}.',
2560
+ 'aerial imagery of a {}.',
2561
+ 'satellite photo of a {}.',
2562
+ 'aerial photo of a {}.',
2563
+ 'satellite view of a {}.',
2564
+ 'aerial view of a {}.',
2565
+ 'satellite imagery of the {}.',
2566
+ 'aerial imagery of the {}.',
2567
+ 'satellite photo of the {}.',
2568
+ 'aerial photo of the {}.',
2569
+ 'satellite view of the {}.',
2570
+ 'aerial view of the {}.',
2571
+ ]
2572
+ ```
2573
+
2574
+
2575
+
2576
+ ## SST2
2577
+
2578
+ ```bash
2579
+ classes = [
2580
+ 'negative',
2581
+ 'positive',
2582
+ ]
2583
+
2584
+ templates = [
2585
+ 'a {} review of a movie.',
2586
+ ]
2587
+ ```
2588
+
2589
+
2590
+
2591
+ ## STL10
2592
+
2593
+ ```bash
2594
+ classes = [
2595
+ 'airplane',
2596
+ 'bird',
2597
+ 'car',
2598
+ 'cat',
2599
+ 'deer',
2600
+ 'dog',
2601
+ 'horse',
2602
+ 'monkey',
2603
+ 'ship',
2604
+ 'truck',
2605
+ ]
2606
+
2607
+ templates = [
2608
+ 'a photo of a {}.',
2609
+ 'a photo of the {}.',
2610
+ ]
2611
+ ```
2612
+
2613
+
2614
+
2615
+ ## SUN397
2616
+
2617
+ ```bash
2618
+ classes = [
2619
+ 'abbey',
2620
+ 'airplane cabin',
2621
+ 'airport terminal',
2622
+ 'alley',
2623
+ 'amphitheater',
2624
+ 'amusement arcade',
2625
+ 'amusement park',
2626
+ 'anechoic chamber',
2627
+ 'apartment building outdoor',
2628
+ 'apse indoor',
2629
+ 'aquarium',
2630
+ 'aqueduct',
2631
+ 'arch',
2632
+ 'archive',
2633
+ 'arrival gate outdoor',
2634
+ 'art gallery',
2635
+ 'art school',
2636
+ 'art studio',
2637
+ 'assembly line',
2638
+ 'athletic field outdoor',
2639
+ 'atrium public',
2640
+ 'attic',
2641
+ 'auditorium',
2642
+ 'auto factory',
2643
+ 'badlands',
2644
+ 'badminton court indoor',
2645
+ 'baggage claim',
2646
+ 'bakery shop',
2647
+ 'balcony exterior',
2648
+ 'balcony interior',
2649
+ 'ball pit',
2650
+ 'ballroom',
2651
+ 'bamboo forest',
2652
+ 'banquet hall',
2653
+ 'bar',
2654
+ 'barn',
2655
+ 'barndoor',
2656
+ 'baseball field',
2657
+ 'basement',
2658
+ 'basilica',
2659
+ 'basketball court outdoor',
2660
+ 'bathroom',
2661
+ 'batters box',
2662
+ 'bayou',
2663
+ 'bazaar indoor',
2664
+ 'bazaar outdoor',
2665
+ 'beach',
2666
+ 'beauty salon',
2667
+ 'bedroom',
2668
+ 'berth',
2669
+ 'biology laboratory',
2670
+ 'bistro indoor',
2671
+ 'boardwalk',
2672
+ 'boat deck',
2673
+ 'boathouse',
2674
+ 'bookstore',
2675
+ 'booth indoor',
2676
+ 'botanical garden',
2677
+ 'bow window indoor',
2678
+ 'bow window outdoor',
2679
+ 'bowling alley',
2680
+ 'boxing ring',
2681
+ 'brewery indoor',
2682
+ 'bridge',
2683
+ 'building facade',
2684
+ 'bullring',
2685
+ 'burial chamber',
2686
+ 'bus interior',
2687
+ 'butchers shop',
2688
+ 'butte',
2689
+ 'cabin outdoor',
2690
+ 'cafeteria',
2691
+ 'campsite',
2692
+ 'campus',
2693
+ 'canal natural',
2694
+ 'canal urban',
2695
+ 'candy store',
2696
+ 'canyon',
2697
+ 'car interior backseat',
2698
+ 'car interior frontseat',
2699
+ 'carrousel',
2700
+ 'casino indoor',
2701
+ 'castle',
2702
+ 'catacomb',
2703
+ 'cathedral indoor',
2704
+ 'cathedral outdoor',
2705
+ 'cavern indoor',
2706
+ 'cemetery',
2707
+ 'chalet',
2708
+ 'cheese factory',
2709
+ 'chemistry lab',
2710
+ 'chicken coop indoor',
2711
+ 'chicken coop outdoor',
2712
+ 'childs room',
2713
+ 'church indoor',
2714
+ 'church outdoor',
2715
+ 'classroom',
2716
+ 'clean room',
2717
+ 'cliff',
2718
+ 'cloister indoor',
2719
+ 'closet',
2720
+ 'clothing store',
2721
+ 'coast',
2722
+ 'cockpit',
2723
+ 'coffee shop',
2724
+ 'computer room',
2725
+ 'conference center',
2726
+ 'conference room',
2727
+ 'construction site',
2728
+ 'control room',
2729
+ 'control tower outdoor',
2730
+ 'corn field',
2731
+ 'corral',
2732
+ 'corridor',
2733
+ 'cottage garden',
2734
+ 'courthouse',
2735
+ 'courtroom',
2736
+ 'courtyard',
2737
+ 'covered bridge exterior',
2738
+ 'creek',
2739
+ 'crevasse',
2740
+ 'crosswalk',
2741
+ 'cubicle office',
2742
+ 'dam',
2743
+ 'delicatessen',
2744
+ 'dentists office',
2745
+ 'desert sand',
2746
+ 'desert vegetation',
2747
+ 'diner indoor',
2748
+ 'diner outdoor',
2749
+ 'dinette home',
2750
+ 'dinette vehicle',
2751
+ 'dining car',
2752
+ 'dining room',
2753
+ 'discotheque',
2754
+ 'dock',
2755
+ 'doorway outdoor',
2756
+ 'dorm room',
2757
+ 'driveway',
2758
+ 'driving range outdoor',
2759
+ 'drugstore',
2760
+ 'electrical substation',
2761
+ 'elevator door',
2762
+ 'elevator interior',
2763
+ 'elevator shaft',
2764
+ 'engine room',
2765
+ 'escalator indoor',
2766
+ 'excavation',
2767
+ 'factory indoor',
2768
+ 'fairway',
2769
+ 'fastfood restaurant',
2770
+ 'field cultivated',
2771
+ 'field wild',
2772
+ 'fire escape',
2773
+ 'fire station',
2774
+ 'firing range indoor',
2775
+ 'fishpond',
2776
+ 'florist shop indoor',
2777
+ 'food court',
2778
+ 'forest broadleaf',
2779
+ 'forest needleleaf',
2780
+ 'forest path',
2781
+ 'forest road',
2782
+ 'formal garden',
2783
+ 'fountain',
2784
+ 'galley',
2785
+ 'game room',
2786
+ 'garage indoor',
2787
+ 'garbage dump',
2788
+ 'gas station',
2789
+ 'gazebo exterior',
2790
+ 'general store indoor',
2791
+ 'general store outdoor',
2792
+ 'gift shop',
2793
+ 'golf course',
2794
+ 'greenhouse indoor',
2795
+ 'greenhouse outdoor',
2796
+ 'gymnasium indoor',
2797
+ 'hangar indoor',
2798
+ 'hangar outdoor',
2799
+ 'harbor',
2800
+ 'hayfield',
2801
+ 'heliport',
2802
+ 'herb garden',
2803
+ 'highway',
2804
+ 'hill',
2805
+ 'home office',
2806
+ 'hospital',
2807
+ 'hospital room',
2808
+ 'hot spring',
2809
+ 'hot tub outdoor',
2810
+ 'hotel outdoor',
2811
+ 'hotel room',
2812
+ 'house',
2813
+ 'hunting lodge outdoor',
2814
+ 'ice cream parlor',
2815
+ 'ice floe',
2816
+ 'ice shelf',
2817
+ 'ice skating rink indoor',
2818
+ 'ice skating rink outdoor',
2819
+ 'iceberg',
2820
+ 'igloo',
2821
+ 'industrial area',
2822
+ 'inn outdoor',
2823
+ 'islet',
2824
+ 'jacuzzi indoor',
2825
+ 'jail cell',
2826
+ 'jail indoor',
2827
+ 'jewelry shop',
2828
+ 'kasbah',
2829
+ 'kennel indoor',
2830
+ 'kennel outdoor',
2831
+ 'kindergarden classroom',
2832
+ 'kitchen',
2833
+ 'kitchenette',
2834
+ 'labyrinth outdoor',
2835
+ 'lake natural',
2836
+ 'landfill',
2837
+ 'landing deck',
2838
+ 'laundromat',
2839
+ 'lecture room',
2840
+ 'library indoor',
2841
+ 'library outdoor',
2842
+ 'lido deck outdoor',
2843
+ 'lift bridge',
2844
+ 'lighthouse',
2845
+ 'limousine interior',
2846
+ 'living room',
2847
+ 'lobby',
2848
+ 'lock chamber',
2849
+ 'locker room',
2850
+ 'mansion',
2851
+ 'manufactured home',
2852
+ 'market indoor',
2853
+ 'market outdoor',
2854
+ 'marsh',
2855
+ 'martial arts gym',
2856
+ 'mausoleum',
2857
+ 'medina',
2858
+ 'moat water',
2859
+ 'monastery outdoor',
2860
+ 'mosque indoor',
2861
+ 'mosque outdoor',
2862
+ 'motel',
2863
+ 'mountain',
2864
+ 'mountain snowy',
2865
+ 'movie theater indoor',
2866
+ 'museum indoor',
2867
+ 'music store',
2868
+ 'music studio',
2869
+ 'nuclear power plant outdoor',
2870
+ 'nursery',
2871
+ 'oast house',
2872
+ 'observatory outdoor',
2873
+ 'ocean',
2874
+ 'office',
2875
+ 'office building',
2876
+ 'oil refinery outdoor',
2877
+ 'oilrig',
2878
+ 'operating room',
2879
+ 'orchard',
2880
+ 'outhouse outdoor',
2881
+ 'pagoda',
2882
+ 'palace',
2883
+ 'pantry',
2884
+ 'park',
2885
+ 'parking garage indoor',
2886
+ 'parking garage outdoor',
2887
+ 'parking lot',
2888
+ 'parlor',
2889
+ 'pasture',
2890
+ 'patio',
2891
+ 'pavilion',
2892
+ 'pharmacy',
2893
+ 'phone booth',
2894
+ 'physics laboratory',
2895
+ 'picnic area',
2896
+ 'pilothouse indoor',
2897
+ 'planetarium outdoor',
2898
+ 'playground',
2899
+ 'playroom',
2900
+ 'plaza',
2901
+ 'podium indoor',
2902
+ 'podium outdoor',
2903
+ 'pond',
2904
+ 'poolroom establishment',
2905
+ 'poolroom home',
2906
+ 'power plant outdoor',
2907
+ 'promenade deck',
2908
+ 'pub indoor',
2909
+ 'pulpit',
2910
+ 'putting green',
2911
+ 'racecourse',
2912
+ 'raceway',
2913
+ 'raft',
2914
+ 'railroad track',
2915
+ 'rainforest',
2916
+ 'reception',
2917
+ 'recreation room',
2918
+ 'residential neighborhood',
2919
+ 'restaurant',
2920
+ 'restaurant kitchen',
2921
+ 'restaurant patio',
2922
+ 'rice paddy',
2923
+ 'riding arena',
2924
+ 'river',
2925
+ 'rock arch',
2926
+ 'rope bridge',
2927
+ 'ruin',
2928
+ 'runway',
2929
+ 'sandbar',
2930
+ 'sandbox',
2931
+ 'sauna',
2932
+ 'schoolhouse',
2933
+ 'sea cliff',
2934
+ 'server room',
2935
+ 'shed',
2936
+ 'shoe shop',
2937
+ 'shopfront',
2938
+ 'shopping mall indoor',
2939
+ 'shower',
2940
+ 'skatepark',
2941
+ 'ski lodge',
2942
+ 'ski resort',
2943
+ 'ski slope',
2944
+ 'sky',
2945
+ 'skyscraper',
2946
+ 'slum',
2947
+ 'snowfield',
2948
+ 'squash court',
2949
+ 'stable',
2950
+ 'stadium baseball',
2951
+ 'stadium football',
2952
+ 'stage indoor',
2953
+ 'staircase',
2954
+ 'street',
2955
+ 'subway interior',
2956
+ 'subway station platform',
2957
+ 'supermarket',
2958
+ 'sushi bar',
2959
+ 'swamp',
2960
+ 'swimming pool indoor',
2961
+ 'swimming pool outdoor',
2962
+ 'synagogue indoor',
2963
+ 'synagogue outdoor',
2964
+ 'television studio',
2965
+ 'temple east asia',
2966
+ 'temple south asia',
2967
+ 'tennis court indoor',
2968
+ 'tennis court outdoor',
2969
+ 'tent outdoor',
2970
+ 'theater indoor procenium',
2971
+ 'theater indoor seats',
2972
+ 'thriftshop',
2973
+ 'throne room',
2974
+ 'ticket booth',
2975
+ 'toll plaza',
2976
+ 'topiary garden',
2977
+ 'tower',
2978
+ 'toyshop',
2979
+ 'track outdoor',
2980
+ 'train railway',
2981
+ 'train station platform',
2982
+ 'tree farm',
2983
+ 'tree house',
2984
+ 'trench',
2985
+ 'underwater coral reef',
2986
+ 'utility room',
2987
+ 'valley',
2988
+ 'van interior',
2989
+ 'vegetable garden',
2990
+ 'veranda',
2991
+ 'veterinarians office',
2992
+ 'viaduct',
2993
+ 'videostore',
2994
+ 'village',
2995
+ 'vineyard',
2996
+ 'volcano',
2997
+ 'volleyball court indoor',
2998
+ 'volleyball court outdoor',
2999
+ 'waiting room',
3000
+ 'warehouse indoor',
3001
+ 'water tower',
3002
+ 'waterfall block',
3003
+ 'waterfall fan',
3004
+ 'waterfall plunge',
3005
+ 'watering hole',
3006
+ 'wave',
3007
+ 'wet bar',
3008
+ 'wheat field',
3009
+ 'wind farm',
3010
+ 'windmill',
3011
+ 'wine cellar barrel storage',
3012
+ 'wine cellar bottle storage',
3013
+ 'wrestling ring indoor',
3014
+ 'yard',
3015
+ 'youth hostel',
3016
+ ]
3017
+
3018
+ templates = [
3019
+ 'a photo of a {}.',
3020
+ 'a photo of the {}.',
3021
+ ]
3022
+ ```
3023
+
3024
+
3025
+
3026
+ ## StanfordCars
3027
+
3028
+ ```bash
3029
+ classes = [
3030
+ 'AM General Hummer SUV 2000',
3031
+ 'Acura RL Sedan 2012',
3032
+ 'Acura TL Sedan 2012',
3033
+ 'Acura TL Type-S 2008',
3034
+ 'Acura TSX Sedan 2012',
3035
+ 'Acura Integra Type R 2001',
3036
+ 'Acura ZDX Hatchback 2012',
3037
+ 'Aston Martin V8 Vantage Convertible 2012',
3038
+ 'Aston Martin V8 Vantage Coupe 2012',
3039
+ 'Aston Martin Virage Convertible 2012',
3040
+ 'Aston Martin Virage Coupe 2012',
3041
+ 'Audi RS 4 Convertible 2008',
3042
+ 'Audi A5 Coupe 2012',
3043
+ 'Audi TTS Coupe 2012',
3044
+ 'Audi R8 Coupe 2012',
3045
+ 'Audi V8 Sedan 1994',
3046
+ 'Audi 100 Sedan 1994',
3047
+ 'Audi 100 Wagon 1994',
3048
+ 'Audi TT Hatchback 2011',
3049
+ 'Audi S6 Sedan 2011',
3050
+ 'Audi S5 Convertible 2012',
3051
+ 'Audi S5 Coupe 2012',
3052
+ 'Audi S4 Sedan 2012',
3053
+ 'Audi S4 Sedan 2007',
3054
+ 'Audi TT RS Coupe 2012',
3055
+ 'BMW ActiveHybrid 5 Sedan 2012',
3056
+ 'BMW 1 Series Convertible 2012',
3057
+ 'BMW 1 Series Coupe 2012',
3058
+ 'BMW 3 Series Sedan 2012',
3059
+ 'BMW 3 Series Wagon 2012',
3060
+ 'BMW 6 Series Convertible 2007',
3061
+ 'BMW X5 SUV 2007',
3062
+ 'BMW X6 SUV 2012',
3063
+ 'BMW M3 Coupe 2012',
3064
+ 'BMW M5 Sedan 2010',
3065
+ 'BMW M6 Convertible 2010',
3066
+ 'BMW X3 SUV 2012',
3067
+ 'BMW Z4 Convertible 2012',
3068
+ 'Bentley Continental Supersports Conv. Convertible 2012',
3069
+ 'Bentley Arnage Sedan 2009',
3070
+ 'Bentley Mulsanne Sedan 2011',
3071
+ 'Bentley Continental GT Coupe 2012',
3072
+ 'Bentley Continental GT Coupe 2007',
3073
+ 'Bentley Continental Flying Spur Sedan 2007',
3074
+ 'Bugatti Veyron 16.4 Convertible 2009',
3075
+ 'Bugatti Veyron 16.4 Coupe 2009',
3076
+ 'Buick Regal GS 2012',
3077
+ 'Buick Rainier SUV 2007',
3078
+ 'Buick Verano Sedan 2012',
3079
+ 'Buick Enclave SUV 2012',
3080
+ 'Cadillac CTS-V Sedan 2012',
3081
+ 'Cadillac SRX SUV 2012',
3082
+ 'Cadillac Escalade EXT Crew Cab 2007',
3083
+ 'Chevrolet Silverado 1500 Hybrid Crew Cab 2012',
3084
+ 'Chevrolet Corvette Convertible 2012',
3085
+ 'Chevrolet Corvette ZR1 2012',
3086
+ 'Chevrolet Corvette Ron Fellows Edition Z06 2007',
3087
+ 'Chevrolet Traverse SUV 2012',
3088
+ 'Chevrolet Camaro Convertible 2012',
3089
+ 'Chevrolet HHR SS 2010',
3090
+ 'Chevrolet Impala Sedan 2007',
3091
+ 'Chevrolet Tahoe Hybrid SUV 2012',
3092
+ 'Chevrolet Sonic Sedan 2012',
3093
+ 'Chevrolet Express Cargo Van 2007',
3094
+ 'Chevrolet Avalanche Crew Cab 2012',
3095
+ 'Chevrolet Cobalt SS 2010',
3096
+ 'Chevrolet Malibu Hybrid Sedan 2010',
3097
+ 'Chevrolet TrailBlazer SS 2009',
3098
+ 'Chevrolet Silverado 2500HD Regular Cab 2012',
3099
+ 'Chevrolet Silverado 1500 Classic Extended Cab 2007',
3100
+ 'Chevrolet Express Van 2007',
3101
+ 'Chevrolet Monte Carlo Coupe 2007',
3102
+ 'Chevrolet Malibu Sedan 2007',
3103
+ 'Chevrolet Silverado 1500 Extended Cab 2012',
3104
+ 'Chevrolet Silverado 1500 Regular Cab 2012',
3105
+ 'Chrysler Aspen SUV 2009',
3106
+ 'Chrysler Sebring Convertible 2010',
3107
+ 'Chrysler Town and Country Minivan 2012',
3108
+ 'Chrysler 300 SRT-8 2010',
3109
+ 'Chrysler Crossfire Convertible 2008',
3110
+ 'Chrysler PT Cruiser Convertible 2008',
3111
+ 'Daewoo Nubira Wagon 2002',
3112
+ 'Dodge Caliber Wagon 2012',
3113
+ 'Dodge Caliber Wagon 2007',
3114
+ 'Dodge Caravan Minivan 1997',
3115
+ 'Dodge Ram Pickup 3500 Crew Cab 2010',
3116
+ 'Dodge Ram Pickup 3500 Quad Cab 2009',
3117
+ 'Dodge Sprinter Cargo Van 2009',
3118
+ 'Dodge Journey SUV 2012',
3119
+ 'Dodge Dakota Crew Cab 2010',
3120
+ 'Dodge Dakota Club Cab 2007',
3121
+ 'Dodge Magnum Wagon 2008',
3122
+ 'Dodge Challenger SRT8 2011',
3123
+ 'Dodge Durango SUV 2012',
3124
+ 'Dodge Durango SUV 2007',
3125
+ 'Dodge Charger Sedan 2012',
3126
+ 'Dodge Charger SRT-8 2009',
3127
+ 'Eagle Talon Hatchback 1998',
3128
+ 'FIAT 500 Abarth 2012',
3129
+ 'FIAT 500 Convertible 2012',
3130
+ 'Ferrari FF Coupe 2012',
3131
+ 'Ferrari California Convertible 2012',
3132
+ 'Ferrari 458 Italia Convertible 2012',
3133
+ 'Ferrari 458 Italia Coupe 2012',
3134
+ 'Fisker Karma Sedan 2012',
3135
+ 'Ford F-450 Super Duty Crew Cab 2012',
3136
+ 'Ford Mustang Convertible 2007',
3137
+ 'Ford Freestar Minivan 2007',
3138
+ 'Ford Expedition EL SUV 2009',
3139
+ 'Ford Edge SUV 2012',
3140
+ 'Ford Ranger SuperCab 2011',
3141
+ 'Ford GT Coupe 2006',
3142
+ 'Ford F-150 Regular Cab 2012',
3143
+ 'Ford F-150 Regular Cab 2007',
3144
+ 'Ford Focus Sedan 2007',
3145
+ 'Ford E-Series Wagon Van 2012',
3146
+ 'Ford Fiesta Sedan 2012',
3147
+ 'GMC Terrain SUV 2012',
3148
+ 'GMC Savana Van 2012',
3149
+ 'GMC Yukon Hybrid SUV 2012',
3150
+ 'GMC Acadia SUV 2012',
3151
+ 'GMC Canyon Extended Cab 2012',
3152
+ 'Geo Metro Convertible 1993',
3153
+ 'HUMMER H3T Crew Cab 2010',
3154
+ 'HUMMER H2 SUT Crew Cab 2009',
3155
+ 'Honda Odyssey Minivan 2012',
3156
+ 'Honda Odyssey Minivan 2007',
3157
+ 'Honda Accord Coupe 2012',
3158
+ 'Honda Accord Sedan 2012',
3159
+ 'Hyundai Veloster Hatchback 2012',
3160
+ 'Hyundai Santa Fe SUV 2012',
3161
+ 'Hyundai Tucson SUV 2012',
3162
+ 'Hyundai Veracruz SUV 2012',
3163
+ 'Hyundai Sonata Hybrid Sedan 2012',
3164
+ 'Hyundai Elantra Sedan 2007',
3165
+ 'Hyundai Accent Sedan 2012',
3166
+ 'Hyundai Genesis Sedan 2012',
3167
+ 'Hyundai Sonata Sedan 2012',
3168
+ 'Hyundai Elantra Touring Hatchback 2012',
3169
+ 'Hyundai Azera Sedan 2012',
3170
+ 'Infiniti G Coupe IPL 2012',
3171
+ 'Infiniti QX56 SUV 2011',
3172
+ 'Isuzu Ascender SUV 2008',
3173
+ 'Jaguar XK XKR 2012',
3174
+ 'Jeep Patriot SUV 2012',
3175
+ 'Jeep Wrangler SUV 2012',
3176
+ 'Jeep Liberty SUV 2012',
3177
+ 'Jeep Grand Cherokee SUV 2012',
3178
+ 'Jeep Compass SUV 2012',
3179
+ 'Lamborghini Reventon Coupe 2008',
3180
+ 'Lamborghini Aventador Coupe 2012',
3181
+ 'Lamborghini Gallardo LP 570-4 Superleggera 2012',
3182
+ 'Lamborghini Diablo Coupe 2001',
3183
+ 'Land Rover Range Rover SUV 2012',
3184
+ 'Land Rover LR2 SUV 2012',
3185
+ 'Lincoln Town Car Sedan 2011',
3186
+ 'MINI Cooper Roadster Convertible 2012',
3187
+ 'Maybach Landaulet Convertible 2012',
3188
+ 'Mazda Tribute SUV 2011',
3189
+ 'McLaren MP4-12C Coupe 2012',
3190
+ 'Mercedes-Benz 300-Class Convertible 1993',
3191
+ 'Mercedes-Benz C-Class Sedan 2012',
3192
+ 'Mercedes-Benz SL-Class Coupe 2009',
3193
+ 'Mercedes-Benz E-Class Sedan 2012',
3194
+ 'Mercedes-Benz S-Class Sedan 2012',
3195
+ 'Mercedes-Benz Sprinter Van 2012',
3196
+ 'Mitsubishi Lancer Sedan 2012',
3197
+ 'Nissan Leaf Hatchback 2012',
3198
+ 'Nissan NV Passenger Van 2012',
3199
+ 'Nissan Juke Hatchback 2012',
3200
+ 'Nissan 240SX Coupe 1998',
3201
+ 'Plymouth Neon Coupe 1999',
3202
+ 'Porsche Panamera Sedan 2012',
3203
+ 'Ram C/V Cargo Van Minivan 2012',
3204
+ 'Rolls-Royce Phantom Drophead Coupe Convertible 2012',
3205
+ 'Rolls-Royce Ghost Sedan 2012',
3206
+ 'Rolls-Royce Phantom Sedan 2012',
3207
+ 'Scion xD Hatchback 2012',
3208
+ 'Spyker C8 Convertible 2009',
3209
+ 'Spyker C8 Coupe 2009',
3210
+ 'Suzuki Aerio Sedan 2007',
3211
+ 'Suzuki Kizashi Sedan 2012',
3212
+ 'Suzuki SX4 Hatchback 2012',
3213
+ 'Suzuki SX4 Sedan 2012',
3214
+ 'Tesla Model S Sedan 2012',
3215
+ 'Toyota Sequoia SUV 2012',
3216
+ 'Toyota Camry Sedan 2012',
3217
+ 'Toyota Corolla Sedan 2012',
3218
+ 'Toyota 4Runner SUV 2012',
3219
+ 'Volkswagen Golf Hatchback 2012',
3220
+ 'Volkswagen Golf Hatchback 1991',
3221
+ 'Volkswagen Beetle Hatchback 2012',
3222
+ 'Volvo C30 Hatchback 2012',
3223
+ 'Volvo 240 Sedan 1993',
3224
+ 'Volvo XC90 SUV 2007',
3225
+ 'smart fortwo Convertible 2012',
3226
+ ]
3227
+
3228
+ templates = [
3229
+ 'a photo of a {}.',
3230
+ 'a photo of the {}.',
3231
+ 'a photo of my {}.',
3232
+ 'i love my {}!',
3233
+ 'a photo of my dirty {}.',
3234
+ 'a photo of my clean {}.',
3235
+ 'a photo of my new {}.',
3236
+ 'a photo of my old {}.',
3237
+ ]
3238
+ ```
3239
+
3240
+
3241
+
3242
+ ## UCF101
3243
+
3244
+ ```bash
3245
+ classes = [
3246
+ 'Apply Eye Makeup',
3247
+ 'Apply Lipstick',
3248
+ 'Archery',
3249
+ 'Baby Crawling',
3250
+ 'Balance Beam',
3251
+ 'Band Marching',
3252
+ 'Baseball Pitch',
3253
+ 'Basketball',
3254
+ 'Basketball Dunk',
3255
+ 'Bench Press',
3256
+ 'Biking',
3257
+ 'Billiards',
3258
+ 'Blow Dry Hair',
3259
+ 'Blowing Candles',
3260
+ 'Body Weight Squats',
3261
+ 'Bowling',
3262
+ 'Boxing Punching Bag',
3263
+ 'Boxing Speed Bag',
3264
+ 'Breast Stroke',
3265
+ 'Brushing Teeth',
3266
+ 'Clean And Jerk',
3267
+ 'Cliff Diving',
3268
+ 'Cricket Bowling',
3269
+ 'Cricket Shot',
3270
+ 'Cutting In Kitchen',
3271
+ 'Diving',
3272
+ 'Drumming',
3273
+ 'Fencing',
3274
+ 'Field Hockey Penalty',
3275
+ 'Floor Gymnastics',
3276
+ 'Frisbee Catch',
3277
+ 'Front Crawl',
3278
+ 'Golf Swing',
3279
+ 'Haircut',
3280
+ 'Hammer Throw',
3281
+ 'Hammering',
3282
+ 'Hand Stand Pushups',
3283
+ 'Handstand Walking',
3284
+ 'Head Massage',
3285
+ 'High Jump',
3286
+ 'Horse Race',
3287
+ 'Horse Riding',
3288
+ 'Hula Hoop',
3289
+ 'Ice Dancing',
3290
+ 'Javelin Throw',
3291
+ 'Juggling Balls',
3292
+ 'Jump Rope',
3293
+ 'Jumping Jack',
3294
+ 'Kayaking',
3295
+ 'Knitting',
3296
+ 'Long Jump',
3297
+ 'Lunges',
3298
+ 'Military Parade',
3299
+ 'Mixing',
3300
+ 'Mopping Floor',
3301
+ 'Nunchucks',
3302
+ 'Parallel Bars',
3303
+ 'Pizza Tossing',
3304
+ 'Playing Cello',
3305
+ 'Playing Daf',
3306
+ 'Playing Dhol',
3307
+ 'Playing Flute',
3308
+ 'Playing Guitar',
3309
+ 'Playing Piano',
3310
+ 'Playing Sitar',
3311
+ 'Playing Tabla',
3312
+ 'Playing Violin',
3313
+ 'Pole Vault',
3314
+ 'Pommel Horse',
3315
+ 'Pull Ups',
3316
+ 'Punch',
3317
+ 'Push Ups',
3318
+ 'Rafting',
3319
+ 'Rock Climbing Indoor',
3320
+ 'Rope Climbing',
3321
+ 'Rowing',
3322
+ 'Salsa Spin',
3323
+ 'Shaving Beard',
3324
+ 'Shotput',
3325
+ 'Skate Boarding',
3326
+ 'Skiing',
3327
+ 'Skijet',
3328
+ 'Sky Diving',
3329
+ 'Soccer Juggling',
3330
+ 'Soccer Penalty',
3331
+ 'Still Rings',
3332
+ 'Sumo Wrestling',
3333
+ 'Surfing',
3334
+ 'Swing',
3335
+ 'Table Tennis Shot',
3336
+ 'Tai Chi',
3337
+ 'Tennis Swing',
3338
+ 'Throw Discus',
3339
+ 'Trampoline Jumping',
3340
+ 'Typing',
3341
+ 'Uneven Bars',
3342
+ 'Volleyball Spiking',
3343
+ 'Walking With Dog',
3344
+ 'Wall Pushups',
3345
+ 'Writing On Board',
3346
+ 'Yo Yo',
3347
+ ]
3348
+
3349
+ templates = [
3350
+ 'a photo of a person {}.',
3351
+ 'a video of a person {}.',
3352
+ 'a example of a person {}.',
3353
+ 'a demonstration of a person {}.',
3354
+ 'a photo of the person {}.',
3355
+ 'a video of the person {}.',
3356
+ 'a example of the person {}.',
3357
+ 'a demonstration of the person {}.',
3358
+ 'a photo of a person using {}.',
3359
+ 'a video of a person using {}.',
3360
+ 'a example of a person using {}.',
3361
+ 'a demonstration of a person using {}.',
3362
+ 'a photo of the person using {}.',
3363
+ 'a video of the person using {}.',
3364
+ 'a example of the person using {}.',
3365
+ 'a demonstration of the person using {}.',
3366
+ 'a photo of a person doing {}.',
3367
+ 'a video of a person doing {}.',
3368
+ 'a example of a person doing {}.',
3369
+ 'a demonstration of a person doing {}.',
3370
+ 'a photo of the person doing {}.',
3371
+ 'a video of the person doing {}.',
3372
+ 'a example of the person doing {}.',
3373
+ 'a demonstration of the person doing {}.',
3374
+ 'a photo of a person during {}.',
3375
+ 'a video of a person during {}.',
3376
+ 'a example of a person during {}.',
3377
+ 'a demonstration of a person during {}.',
3378
+ 'a photo of the person during {}.',
3379
+ 'a video of the person during {}.',
3380
+ 'a example of the person during {}.',
3381
+ 'a demonstration of the person during {}.',
3382
+ 'a photo of a person performing {}.',
3383
+ 'a video of a person performing {}.',
3384
+ 'a example of a person performing {}.',
3385
+ 'a demonstration of a person performing {}.',
3386
+ 'a photo of the person performing {}.',
3387
+ 'a video of the person performing {}.',
3388
+ 'a example of the person performing {}.',
3389
+ 'a demonstration of the person performing {}.',
3390
+ 'a photo of a person practicing {}.',
3391
+ 'a video of a person practicing {}.',
3392
+ 'a example of a person practicing {}.',
3393
+ 'a demonstration of a person practicing {}.',
3394
+ 'a photo of the person practicing {}.',
3395
+ 'a video of the person practicing {}.',
3396
+ 'a example of the person practicing {}.',
3397
+ 'a demonstration of the person practicing {}.',
3398
+ ]
3399
+ ```
3400
+
3401
+
CLIP/data/rendered-sst2.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # The Rendered SST2 Dataset
2
+
3
+ In the paper, we used an image classification dataset called Rendered SST2, to evaluate the model's capability on optical character recognition. To do so, we rendered the sentences in the [Standford Sentiment Treebank v2](https://nlp.stanford.edu/sentiment/treebank.html) dataset and used those as the input to the CLIP image encoder.
4
+
5
+ The following command will download a 131MB archive countaining the images and extract into a subdirectory `rendered-sst2`:
6
+
7
+ ```bash
8
+ wget https://openaipublic.azureedge.net/clip/data/rendered-sst2.tgz
9
+ tar zxvf rendered-sst2.tgz
10
+ ```
11
+
CLIP/data/yfcc100m.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # The YFCC100M Subset
2
+
3
+ In the paper, we performed a dataset ablation using a subset of the YFCC100M dataset and showed that the performance remained largely similar.
4
+
5
+ The subset contains 14,829,396 images, about 15% of the full dataset, which have been filtered to only keep those with natural languag titles and/or descriptions in English.
6
+
7
+ We provide the list of (line number, photo identifier, photo hash) of each image contained in this subset. These correspond to the first three columns in the dataset's metadata TSV file.
8
+
9
+ ```bash
10
+ wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
11
+ bunzip2 yfcc100m_subset_data.tsv.bz2
12
+ ```
13
+
14
+ Use of the underlying media files is subject to the Creative Commons licenses chosen by their creators/uploaders. For more information about the YFCC100M dataset, visit [the official website](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/).
CLIP/hubconf.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from clip.clip import tokenize as _tokenize, load as _load, available_models as _available_models
2
+ import re
3
+ import string
4
+
5
+ dependencies = ["torch", "torchvision", "ftfy", "regex", "tqdm"]
6
+
7
+ # For compatibility (cannot include special characters in function name)
8
+ model_functions = { model: re.sub(f'[{string.punctuation}]', '_', model) for model in _available_models()}
9
+
10
+ def _create_hub_entrypoint(model):
11
+ def entrypoint(**kwargs):
12
+ return _load(model, **kwargs)
13
+
14
+ entrypoint.__doc__ = f"""Loads the {model} CLIP model
15
+
16
+ Parameters
17
+ ----------
18
+ device : Union[str, torch.device]
19
+ The device to put the loaded model
20
+
21
+ jit : bool
22
+ Whether to load the optimized JIT model or more hackable non-JIT model (default).
23
+
24
+ download_root: str
25
+ path to download the model files; by default, it uses "~/.cache/clip"
26
+
27
+ Returns
28
+ -------
29
+ model : torch.nn.Module
30
+ The {model} CLIP model
31
+
32
+ preprocess : Callable[[PIL.Image], torch.Tensor]
33
+ A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input
34
+ """
35
+ return entrypoint
36
+
37
+ def tokenize():
38
+ return _tokenize
39
+
40
+ _entrypoints = {model_functions[model]: _create_hub_entrypoint(model) for model in _available_models()}
41
+
42
+ globals().update(_entrypoints)
CLIP/model-card.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card: CLIP
2
+
3
+ Inspired by [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993) and [Lessons from Archives (Jo & Gebru)](https://arxiv.org/pdf/1912.10389.pdf), we’re providing some accompanying information about the multimodal model.
4
+
5
+ ## Model Details
6
+
7
+ The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.
8
+
9
+ ### Model Date
10
+
11
+ January 2021
12
+
13
+ ### Model Type
14
+
15
+ The base model uses a ResNet50 with several modifications as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
16
+
17
+ ### Model Versions
18
+
19
+ Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
20
+
21
+ As part of the staged release process, we have also released the RN101 model, as well as RN50x4, a RN50 scaled up 4x according to the [EfficientNet](https://arxiv.org/abs/1905.11946) scaling rule. In July 2021, we additionally released the RN50x16 and ViT-B/16 models, and in January 2022, the RN50x64 and ViT-L/14 models were released. Lastly, the ViT-L/14@336px model was released in April 2022.
22
+
23
+ Please see the paper linked below for further details about their specification.
24
+
25
+ ### Documents
26
+
27
+ - [Blog Post](https://openai.com/blog/clip/)
28
+ - [CLIP Paper](https://arxiv.org/abs/2103.00020)
29
+
30
+
31
+
32
+ ## Model Use
33
+
34
+ ### Intended Use
35
+
36
+ The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
37
+
38
+ #### Primary intended uses
39
+
40
+ The primary intended users of these models are AI researchers.
41
+
42
+ We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
43
+
44
+ ### Out-of-Scope Use Cases
45
+
46
+ **Any** deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful.
47
+
48
+ Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.
49
+
50
+ Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
51
+
52
+
53
+
54
+ ## Data
55
+
56
+ The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as [YFCC100M](http://projects.dfki.uni-kl.de/yfcc100m/). A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users.
57
+
58
+ ### Data Mission Statement
59
+
60
+ Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset.
61
+
62
+
63
+
64
+ ## Performance and Limitations
65
+
66
+ ### Performance
67
+
68
+ We have evaluated the performance of CLIP on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The paper describes model performance on the following datasets:
69
+
70
+ - Food101
71
+ - CIFAR10
72
+ - CIFAR100
73
+ - Birdsnap
74
+ - SUN397
75
+ - Stanford Cars
76
+ - FGVC Aircraft
77
+ - VOC2007
78
+ - DTD
79
+ - Oxford-IIIT Pet dataset
80
+ - Caltech101
81
+ - Flowers102
82
+ - MNIST
83
+ - SVHN
84
+ - IIIT5K
85
+ - Hateful Memes
86
+ - SST-2
87
+ - UCF101
88
+ - Kinetics700
89
+ - Country211
90
+ - CLEVR Counting
91
+ - KITTI Distance
92
+ - STL-10
93
+ - RareAct
94
+ - Flickr30
95
+ - MSCOCO
96
+ - ImageNet
97
+ - ImageNet-A
98
+ - ImageNet-R
99
+ - ImageNet Sketch
100
+ - ObjectNet (ImageNet Overlap)
101
+ - Youtube-BB
102
+ - ImageNet-Vid
103
+
104
+ ## Limitations
105
+
106
+ CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance.
107
+
108
+ ### Bias and Fairness
109
+
110
+ We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from [Fairface](https://arxiv.org/abs/1908.04913) into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper).
111
+
112
+ We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.
113
+
114
+
115
+
116
+ ## Feedback
117
+
118
+ ### Where to send questions or comments about the model
119
+
120
+ Please use [this Google Form](https://forms.gle/Uv7afRH5dvY34ZEs9)
CLIP/requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
1
+ ftfy
2
+ regex
3
+ tqdm
4
+ torch
5
+ torchvision
CLIP/setup.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ import pkg_resources
4
+ from setuptools import setup, find_packages
5
+
6
+ setup(
7
+ name="clip",
8
+ py_modules=["clip"],
9
+ version="1.0",
10
+ description="",
11
+ author="OpenAI",
12
+ packages=find_packages(exclude=["tests*"]),
13
+ install_requires=[
14
+ str(r)
15
+ for r in pkg_resources.parse_requirements(
16
+ open(os.path.join(os.path.dirname(__file__), "requirements.txt"))
17
+ )
18
+ ],
19
+ include_package_data=True,
20
+ extras_require={'dev': ['pytest']},
21
+ )
CLIP/tests/test_consistency.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import pytest
3
+ import torch
4
+ from PIL import Image
5
+
6
+ import clip
7
+
8
+
9
+ @pytest.mark.parametrize('model_name', clip.available_models())
10
+ def test_consistency(model_name):
11
+ device = "cpu"
12
+ jit_model, transform = clip.load(model_name, device=device, jit=True)
13
+ py_model, _ = clip.load(model_name, device=device, jit=False)
14
+
15
+ image = transform(Image.open("CLIP.png")).unsqueeze(0).to(device)
16
+ text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
17
+
18
+ with torch.no_grad():
19
+ logits_per_image, _ = jit_model(image, text)
20
+ jit_probs = logits_per_image.softmax(dim=-1).cpu().numpy()
21
+
22
+ logits_per_image, _ = py_model(image, text)
23
+ py_probs = logits_per_image.softmax(dim=-1).cpu().numpy()
24
+
25
+ assert np.allclose(jit_probs, py_probs, atol=0.01, rtol=0.1)
taming-transformers DELETED
@@ -1 +0,0 @@
1
- Subproject commit 3ba01b241669f5ade541ce990f7650a3b8f65318
 
taming-transformers/.gitignore ADDED
@@ -0,0 +1 @@
 
1
+ *.ipynb
taming-transformers/License.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright (c) 2020 Patrick Esser and Robin Rombach and Björn Ommer
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
14
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
15
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
16
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
17
+ DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
18
+ OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
19
+ OR OTHER DEALINGS IN THE SOFTWARE./
taming-transformers/configs/coco_cond_stage.yaml ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-06
3
+ target: taming.models.vqgan.VQSegmentationModel
4
+ params:
5
+ embed_dim: 256
6
+ n_embed: 1024
7
+ image_key: "segmentation"
8
+ n_labels: 183
9
+ ddconfig:
10
+ double_z: false
11
+ z_channels: 256
12
+ resolution: 256
13
+ in_channels: 183
14
+ out_ch: 183
15
+ ch: 128
16
+ ch_mult:
17
+ - 1
18
+ - 1
19
+ - 2
20
+ - 2
21
+ - 4
22
+ num_res_blocks: 2
23
+ attn_resolutions:
24
+ - 16
25
+ dropout: 0.0
26
+
27
+ lossconfig:
28
+ target: taming.modules.losses.segmentation.BCELossWithQuant
29
+ params:
30
+ codebook_weight: 1.0
31
+
32
+ data:
33
+ target: main.DataModuleFromConfig
34
+ params:
35
+ batch_size: 12
36
+ train:
37
+ target: taming.data.coco.CocoImagesAndCaptionsTrain
38
+ params:
39
+ size: 296
40
+ crop_size: 256
41
+ onehot_segmentation: true
42
+ use_stuffthing: true
43
+ validation:
44
+ target: taming.data.coco.CocoImagesAndCaptionsValidation
45
+ params:
46
+ size: 256
47
+ crop_size: 256
48
+ onehot_segmentation: true
49
+ use_stuffthing: true
taming-transformers/configs/coco_scene_images_transformer.yaml ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-06
3
+ target: taming.models.cond_transformer.Net2NetTransformer
4
+ params:
5
+ cond_stage_key: objects_bbox
6
+ transformer_config:
7
+ target: taming.modules.transformer.mingpt.GPT
8
+ params:
9
+ vocab_size: 8192
10
+ block_size: 348 # = 256 + 92 = dim(vqgan_latent_space,16x16) + dim(conditional_builder.embedding_dim)
11
+ n_layer: 40
12
+ n_head: 16
13
+ n_embd: 1408
14
+ embd_pdrop: 0.1
15
+ resid_pdrop: 0.1
16
+ attn_pdrop: 0.1
17
+ first_stage_config:
18
+ target: taming.models.vqgan.VQModel
19
+ params:
20
+ ckpt_path: /path/to/coco_epoch117.ckpt # https://heibox.uni-heidelberg.de/f/78dea9589974474c97c1/
21
+ embed_dim: 256
22
+ n_embed: 8192
23
+ ddconfig:
24
+ double_z: false
25
+ z_channels: 256
26
+ resolution: 256
27
+ in_channels: 3
28
+ out_ch: 3
29
+ ch: 128
30
+ ch_mult:
31
+ - 1
32
+ - 1
33
+ - 2
34
+ - 2
35
+ - 4
36
+ num_res_blocks: 2
37
+ attn_resolutions:
38
+ - 16
39
+ dropout: 0.0
40
+ lossconfig:
41
+ target: taming.modules.losses.DummyLoss
42
+ cond_stage_config:
43
+ target: taming.models.dummy_cond_stage.DummyCondStage
44
+ params:
45
+ conditional_key: objects_bbox
46
+
47
+ data:
48
+ target: main.DataModuleFromConfig
49
+ params:
50
+ batch_size: 6
51
+ train:
52
+ target: taming.data.annotated_objects_coco.AnnotatedObjectsCoco
53
+ params:
54
+ data_path: data/coco_annotations_100 # substitute with path to full dataset
55
+ split: train
56
+ keys: [image, objects_bbox, file_name, annotations]
57
+ no_tokens: 8192
58
+ target_image_size: 256
59
+ min_object_area: 0.00001
60
+ min_objects_per_image: 2
61
+ max_objects_per_image: 30
62
+ crop_method: random-1d
63
+ random_flip: true
64
+ use_group_parameter: true
65
+ encode_crop: true
66
+ validation:
67
+ target: taming.data.annotated_objects_coco.AnnotatedObjectsCoco
68
+ params:
69
+ data_path: data/coco_annotations_100 # substitute with path to full dataset
70
+ split: validation
71
+ keys: [image, objects_bbox, file_name, annotations]
72
+ no_tokens: 8192
73
+ target_image_size: 256
74
+ min_object_area: 0.00001
75
+ min_objects_per_image: 2
76
+ max_objects_per_image: 30
77
+ crop_method: center
78
+ random_flip: false
79
+ use_group_parameter: true
80
+ encode_crop: true
taming-transformers/configs/custom_vqgan.yaml ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-6
3
+ target: taming.models.vqgan.VQModel
4
+ params:
5
+ embed_dim: 256
6
+ n_embed: 1024
7
+ ddconfig:
8
+ double_z: False
9
+ z_channels: 256
10
+ resolution: 256
11
+ in_channels: 3
12
+ out_ch: 3
13
+ ch: 128
14
+ ch_mult: [ 1,1,2,2,4] # num_down = len(ch_mult)-1
15
+ num_res_blocks: 2
16
+ attn_resolutions: [16]
17
+ dropout: 0.0
18
+
19
+ lossconfig:
20
+ target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
21
+ params:
22
+ disc_conditional: False
23
+ disc_in_channels: 3
24
+ disc_start: 10000
25
+ disc_weight: 0.8
26
+ codebook_weight: 1.0
27
+
28
+ data:
29
+ target: main.DataModuleFromConfig
30
+ params:
31
+ batch_size: 5
32
+ num_workers: 8
33
+ train:
34
+ target: taming.data.custom.CustomTrain
35
+ params:
36
+ training_images_list_file: some/training.txt
37
+ size: 256
38
+ validation:
39
+ target: taming.data.custom.CustomTest
40
+ params:
41
+ test_images_list_file: some/test.txt
42
+ size: 256
43
+
taming-transformers/configs/drin_transformer.yaml ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-06
3
+ target: taming.models.cond_transformer.Net2NetTransformer
4
+ params:
5
+ cond_stage_key: depth
6
+ transformer_config:
7
+ target: taming.modules.transformer.mingpt.GPT
8
+ params:
9
+ vocab_size: 1024
10
+ block_size: 512
11
+ n_layer: 24
12
+ n_head: 16
13
+ n_embd: 1024
14
+ first_stage_config:
15
+ target: taming.models.vqgan.VQModel
16
+ params:
17
+ ckpt_path: logs/2020-09-23T17-56-33_imagenet_vqgan/checkpoints/last.ckpt
18
+ embed_dim: 256
19
+ n_embed: 1024
20
+ ddconfig:
21
+ double_z: false
22
+ z_channels: 256
23
+ resolution: 256
24
+ in_channels: 3
25
+ out_ch: 3
26
+ ch: 128
27
+ ch_mult:
28
+ - 1
29
+ - 1
30
+ - 2
31
+ - 2
32
+ - 4
33
+ num_res_blocks: 2
34
+ attn_resolutions:
35
+ - 16
36
+ dropout: 0.0
37
+ lossconfig:
38
+ target: taming.modules.losses.DummyLoss
39
+ cond_stage_config:
40
+ target: taming.models.vqgan.VQModel
41
+ params:
42
+ ckpt_path: logs/2020-11-03T15-34-24_imagenetdepth_vqgan/checkpoints/last.ckpt
43
+ embed_dim: 256
44
+ n_embed: 1024
45
+ ddconfig:
46
+ double_z: false
47
+ z_channels: 256
48
+ resolution: 256
49
+ in_channels: 1
50
+ out_ch: 1
51
+ ch: 128
52
+ ch_mult:
53
+ - 1
54
+ - 1
55
+ - 2
56
+ - 2
57
+ - 4
58
+ num_res_blocks: 2
59
+ attn_resolutions:
60
+ - 16
61
+ dropout: 0.0
62
+ lossconfig:
63
+ target: taming.modules.losses.DummyLoss
64
+
65
+ data:
66
+ target: main.DataModuleFromConfig
67
+ params:
68
+ batch_size: 2
69
+ num_workers: 8
70
+ train:
71
+ target: taming.data.imagenet.RINTrainWithDepth
72
+ params:
73
+ size: 256
74
+ validation:
75
+ target: taming.data.imagenet.RINValidationWithDepth
76
+ params:
77
+ size: 256
taming-transformers/configs/faceshq_transformer.yaml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-06
3
+ target: taming.models.cond_transformer.Net2NetTransformer
4
+ params:
5
+ cond_stage_key: coord
6
+ transformer_config:
7
+ target: taming.modules.transformer.mingpt.GPT
8
+ params:
9
+ vocab_size: 1024
10
+ block_size: 512
11
+ n_layer: 24
12
+ n_head: 16
13
+ n_embd: 1024
14
+ first_stage_config:
15
+ target: taming.models.vqgan.VQModel
16
+ params:
17
+ ckpt_path: logs/2020-11-09T13-33-36_faceshq_vqgan/checkpoints/last.ckpt
18
+ embed_dim: 256
19
+ n_embed: 1024
20
+ ddconfig:
21
+ double_z: false
22
+ z_channels: 256
23
+ resolution: 256
24
+ in_channels: 3
25
+ out_ch: 3
26
+ ch: 128
27
+ ch_mult:
28
+ - 1
29
+ - 1
30
+ - 2
31
+ - 2
32
+ - 4
33
+ num_res_blocks: 2
34
+ attn_resolutions:
35
+ - 16
36
+ dropout: 0.0
37
+ lossconfig:
38
+ target: taming.modules.losses.DummyLoss
39
+ cond_stage_config:
40
+ target: taming.modules.misc.coord.CoordStage
41
+ params:
42
+ n_embed: 1024
43
+ down_factor: 16
44
+
45
+ data:
46
+ target: main.DataModuleFromConfig
47
+ params:
48
+ batch_size: 2
49
+ num_workers: 8
50
+ train:
51
+ target: taming.data.faceshq.FacesHQTrain
52
+ params:
53
+ size: 256
54
+ crop_size: 256
55
+ coord: True
56
+ validation:
57
+ target: taming.data.faceshq.FacesHQValidation
58
+ params:
59
+ size: 256
60
+ crop_size: 256
61
+ coord: True
taming-transformers/configs/faceshq_vqgan.yaml ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-6
3
+ target: taming.models.vqgan.VQModel
4
+ params:
5
+ embed_dim: 256
6
+ n_embed: 1024
7
+ ddconfig:
8
+ double_z: False
9
+ z_channels: 256
10
+ resolution: 256
11
+ in_channels: 3
12
+ out_ch: 3
13
+ ch: 128
14
+ ch_mult: [ 1,1,2,2,4] # num_down = len(ch_mult)-1
15
+ num_res_blocks: 2
16
+ attn_resolutions: [16]
17
+ dropout: 0.0
18
+
19
+ lossconfig:
20
+ target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
21
+ params:
22
+ disc_conditional: False
23
+ disc_in_channels: 3
24
+ disc_start: 30001
25
+ disc_weight: 0.8
26
+ codebook_weight: 1.0
27
+
28
+ data:
29
+ target: main.DataModuleFromConfig
30
+ params:
31
+ batch_size: 3
32
+ num_workers: 8
33
+ train:
34
+ target: taming.data.faceshq.FacesHQTrain
35
+ params:
36
+ size: 256
37
+ crop_size: 256
38
+ validation:
39
+ target: taming.data.faceshq.FacesHQValidation
40
+ params:
41
+ size: 256
42
+ crop_size: 256
taming-transformers/configs/imagenet_vqgan.yaml ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-6
3
+ target: taming.models.vqgan.VQModel
4
+ params:
5
+ embed_dim: 256
6
+ n_embed: 1024
7
+ ddconfig:
8
+ double_z: False
9
+ z_channels: 256
10
+ resolution: 256
11
+ in_channels: 3
12
+ out_ch: 3
13
+ ch: 128
14
+ ch_mult: [ 1,1,2,2,4] # num_down = len(ch_mult)-1
15
+ num_res_blocks: 2
16
+ attn_resolutions: [16]
17
+ dropout: 0.0
18
+
19
+ lossconfig:
20
+ target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
21
+ params:
22
+ disc_conditional: False
23
+ disc_in_channels: 3
24
+ disc_start: 250001
25
+ disc_weight: 0.8
26
+ codebook_weight: 1.0
27
+
28
+ data:
29
+ target: main.DataModuleFromConfig
30
+ params:
31
+ batch_size: 12
32
+ num_workers: 24
33
+ train:
34
+ target: taming.data.imagenet.ImageNetTrain
35
+ params:
36
+ config:
37
+ size: 256
38
+ validation:
39
+ target: taming.data.imagenet.ImageNetValidation
40
+ params:
41
+ config:
42
+ size: 256
taming-transformers/configs/imagenetdepth_vqgan.yaml ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-6
3
+ target: taming.models.vqgan.VQModel
4
+ params:
5
+ embed_dim: 256
6
+ n_embed: 1024
7
+ image_key: depth
8
+ ddconfig:
9
+ double_z: False
10
+ z_channels: 256
11
+ resolution: 256
12
+ in_channels: 1
13
+ out_ch: 1
14
+ ch: 128
15
+ ch_mult: [ 1,1,2,2,4] # num_down = len(ch_mult)-1
16
+ num_res_blocks: 2
17
+ attn_resolutions: [16]
18
+ dropout: 0.0
19
+
20
+ lossconfig:
21
+ target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
22
+ params:
23
+ disc_conditional: False
24
+ disc_in_channels: 1
25
+ disc_start: 50001
26
+ disc_weight: 0.75
27
+ codebook_weight: 1.0
28
+
29
+ data:
30
+ target: main.DataModuleFromConfig
31
+ params:
32
+ batch_size: 3
33
+ num_workers: 8
34
+ train:
35
+ target: taming.data.imagenet.ImageNetTrainWithDepth
36
+ params:
37
+ size: 256
38
+ validation:
39
+ target: taming.data.imagenet.ImageNetValidationWithDepth
40
+ params:
41
+ size: 256
taming-transformers/configs/open_images_scene_images_transformer.yaml ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-06
3
+ target: taming.models.cond_transformer.Net2NetTransformer
4
+ params:
5
+ cond_stage_key: objects_bbox
6
+ transformer_config:
7
+ target: taming.modules.transformer.mingpt.GPT
8
+ params:
9
+ vocab_size: 8192
10
+ block_size: 348 # = 256 + 92 = dim(vqgan_latent_space,16x16) + dim(conditional_builder.embedding_dim)
11
+ n_layer: 36
12
+ n_head: 16
13
+ n_embd: 1536
14
+ embd_pdrop: 0.1
15
+ resid_pdrop: 0.1
16
+ attn_pdrop: 0.1
17
+ first_stage_config:
18
+ target: taming.models.vqgan.VQModel
19
+ params:
20
+ ckpt_path: /path/to/coco_oi_epoch12.ckpt # https://heibox.uni-heidelberg.de/f/461d9a9f4fcf48ab84f4/
21
+ embed_dim: 256
22
+ n_embed: 8192
23
+ ddconfig:
24
+ double_z: false
25
+ z_channels: 256
26
+ resolution: 256
27
+ in_channels: 3
28
+ out_ch: 3
29
+ ch: 128
30
+ ch_mult:
31
+ - 1
32
+ - 1
33
+ - 2
34
+ - 2
35
+ - 4
36
+ num_res_blocks: 2
37
+ attn_resolutions:
38
+ - 16
39
+ dropout: 0.0
40
+ lossconfig:
41
+ target: taming.modules.losses.DummyLoss
42
+ cond_stage_config:
43
+ target: taming.models.dummy_cond_stage.DummyCondStage
44
+ params:
45
+ conditional_key: objects_bbox
46
+
47
+ data:
48
+ target: main.DataModuleFromConfig
49
+ params:
50
+ batch_size: 6
51
+ train:
52
+ target: taming.data.annotated_objects_open_images.AnnotatedObjectsOpenImages
53
+ params:
54
+ data_path: data/open_images_annotations_100 # substitute with path to full dataset
55
+ split: train
56
+ keys: [image, objects_bbox, file_name, annotations]
57
+ no_tokens: 8192
58
+ target_image_size: 256
59
+ category_allow_list_target: taming.data.open_images_helper.top_300_classes_plus_coco_compatibility
60
+ category_mapping_target: taming.data.open_images_helper.open_images_unify_categories_for_coco
61
+ min_object_area: 0.0001
62
+ min_objects_per_image: 2
63
+ max_objects_per_image: 30
64
+ crop_method: random-2d
65
+ random_flip: true
66
+ use_group_parameter: true
67
+ use_additional_parameters: true
68
+ encode_crop: true
69
+ validation:
70
+ target: taming.data.annotated_objects_open_images.AnnotatedObjectsOpenImages
71
+ params:
72
+ data_path: data/open_images_annotations_100 # substitute with path to full dataset
73
+ split: validation
74
+ keys: [image, objects_bbox, file_name, annotations]
75
+ no_tokens: 8192
76
+ target_image_size: 256
77
+ category_allow_list_target: taming.data.open_images_helper.top_300_classes_plus_coco_compatibility
78
+ category_mapping_target: taming.data.open_images_helper.open_images_unify_categories_for_coco
79
+ min_object_area: 0.0001
80
+ min_objects_per_image: 2
81
+ max_objects_per_image: 30
82
+ crop_method: center
83
+ random_flip: false
84
+ use_group_parameter: true
85
+ use_additional_parameters: true
86
+ encode_crop: true
taming-transformers/configs/sflckr_cond_stage.yaml ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ base_learning_rate: 4.5e-06
3
+ target: taming.models.vqgan.VQSegmentationModel
4
+ params:
5
+ embed_dim: 256
6
+ n_embed: 1024
7
+ image_key: "segmentation"
8
+ n_labels: 182
9
+ ddconfig:
10
+ double_z: false
11
+ z_channels: 256
12
+ resolution: 256
13
+ in_channels: 182
14
+ out_ch: 182
15
+ ch: 128
16
+ ch_mult:
17
+ - 1
18
+ - 1
19
+ - 2
20
+ - 2
21
+ - 4
22
+ num_res_blocks: 2
23
+ attn_resolutions:
24
+ - 16
25
+ dropout: 0.0
26
+
27
+ lossconfig:
28
+ target: taming.modules.losses.segmentation.BCELossWithQuant
29
+ params:
30
+ codebook_weight: 1.0
31
+
32
+ data:
33
+ target: cutlit.DataModuleFromConfig
34
+ params:
35
+ batch_size: 12
36
+ train:
37
+ target: taming.data.sflckr.Examples # adjust
38
+ params:
39
+ size: 256
40
+ validation:
41
+ target: taming.data.sflckr.Examples # adjust
42
+ params:
43
+ size: 256
taming-transformers/data/ade20k_examples.txt ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ADE_val_00000636.jpg
2
+ ADE_val_00000126.jpg
3
+ ADE_val_00001412.jpg
4
+ ADE_val_00001845.jpg
5
+ ADE_val_00001200.jpg
6
+ ADE_val_00001578.jpg
7
+ ADE_val_00000880.jpg
8
+ ADE_val_00000875.jpg
9
+ ADE_val_00000123.jpg
10
+ ADE_val_00001209.jpg
11
+ ADE_val_00000203.jpg
12
+ ADE_val_00001851.jpg
13
+ ADE_val_00001583.jpg
14
+ ADE_val_00000287.jpg
15
+ ADE_val_00001947.jpg
16
+ ADE_val_00000262.jpg
17
+ ADE_val_00000603.jpg
18
+ ADE_val_00000125.jpg
19
+ ADE_val_00001698.jpg
20
+ ADE_val_00001966.jpg
21
+ ADE_val_00000532.jpg
22
+ ADE_val_00001177.jpg
23
+ ADE_val_00000734.jpg
24
+ ADE_val_00001498.jpg
25
+ ADE_val_00001766.jpg
26
+ ADE_val_00000303.jpg
27
+ ADE_val_00000509.jpg
28
+ ADE_val_00000573.jpg
29
+ ADE_val_00000289.jpg
30
+ ADE_val_00001388.jpg
taming-transformers/data/ade20k_images/ADE_val_00000123.jpg ADDED
taming-transformers/data/ade20k_images/ADE_val_00000125.jpg ADDED
taming-transformers/data/ade20k_images/ADE_val_00000126.jpg ADDED
taming-transformers/data/ade20k_images/ADE_val_00000203.jpg ADDED
taming-transformers/data/ade20k_images/ADE_val_00000262.jpg ADDED
taming-transformers/data/ade20k_images/ADE_val_00000287.jpg ADDED
taming-transformers/data/ade20k_images/ADE_val_00000289.jpg ADDED
taming-transformers/data/ade20k_images/ADE_val_00000303.jpg ADDED