rwightman HF staff commited on
Commit
4dcdb8e
1 Parent(s): 474aacb
Files changed (4) hide show
  1. README.md +175 -0
  2. config.json +34 -0
  3. model.safetensors +3 -0
  4. pytorch_model.bin +3 -0
README.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - image-classification
4
+ - timm
5
+ library_name: timm
6
+ license: mit
7
+ datasets:
8
+ - imagenet-1k
9
+ ---
10
+ # Model card for vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k
11
+
12
+ A Vision Transformer (ViT) image classification model. This is a `timm` specific variation of the architecture with rotary position embeddings (ROPE), registers, global average pooling.
13
+
14
+ There are a number of models in the lower end of model scales that originate in `timm`:
15
+
16
+ | variant | width | mlp width (mult) | heads | depth | timm orig |
17
+ | ------- | ----- | ---------------- | ----- | ----- | ---- |
18
+ | tiny | 192 | 768 (4) | 3 | 12 | n |
19
+ | wee | 256 | 1280 (5) | 4 | 14 | y |
20
+ | pwee | 256 | 1280 (5) | 4 | 16 (parallel) | y |
21
+ | small | 384 | 1536 (4) | 6 | 12 | n |
22
+ | little | 320 | 1792 (5.6) | 5 | 14 | y |
23
+ | medium | 512 | 2048 (4) | 8 | 12 | y |
24
+ | mediumd | 512 | 2048 (4) | 8 | 20 | y |
25
+ | betwixt | 640 | 2560 (4) | 10 | 12 | y |
26
+ | base | 768 | 3072 (4) | 12 | 12 | n |
27
+
28
+ Trained on ImageNet-1k in `timm` using recipe template described below.
29
+
30
+ Recipe details:
31
+ * Searching for better baselines. Influced by Swin/DeiT/DeiT-III but w/ increased weight decay, moderate (in12k) to high (in1k) augmentation. Layer-decay used for fine-tune. Some runs used BCE and/or NAdamW instead of AdamW.
32
+ * See [train_hparams.yaml](./train_hparams.yaml) for specifics of each model.
33
+
34
+
35
+ ## Model Details
36
+ - **Model Type:** Image classification / feature backbone
37
+ - **Model Stats:**
38
+ - Params (M): 63.9
39
+ - GMACs: 16.3
40
+ - Activations (M): 23.8
41
+ - Image size: 256 x 256
42
+ - **Papers:**
43
+ - EVA-02: A Visual Representation for Neon Genesis: https://arxiv.org/abs/2303.11331
44
+ - Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588
45
+ - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
46
+ - **Dataset:** ImageNet-1k
47
+ - **Original:** https://github.com/huggingface/pytorch-image-models
48
+
49
+ ## Model Usage
50
+ ### Image Classification
51
+ ```python
52
+ from urllib.request import urlopen
53
+ from PIL import Image
54
+ import timm
55
+
56
+ img = Image.open(urlopen(
57
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
58
+ ))
59
+
60
+ model = timm.create_model('vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k', pretrained=True)
61
+ model = model.eval()
62
+
63
+ # get model specific transforms (normalization, resize)
64
+ data_config = timm.data.resolve_model_data_config(model)
65
+ transforms = timm.data.create_transform(**data_config, is_training=False)
66
+
67
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
68
+
69
+ top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
70
+ ```
71
+
72
+ ### Feature Map Extraction
73
+ ```python
74
+ from urllib.request import urlopen
75
+ from PIL import Image
76
+ import timm
77
+
78
+ img = Image.open(urlopen(
79
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
80
+ ))
81
+
82
+ model = timm.create_model(
83
+ 'vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k',
84
+ pretrained=True,
85
+ features_only=True,
86
+ )
87
+ model = model.eval()
88
+
89
+ # get model specific transforms (normalization, resize)
90
+ data_config = timm.data.resolve_model_data_config(model)
91
+ transforms = timm.data.create_transform(**data_config, is_training=False)
92
+
93
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
94
+
95
+ for o in output:
96
+ # print shape of each feature map in output
97
+ # e.g.:
98
+ # torch.Size([1, 512, 16, 16])
99
+ # torch.Size([1, 512, 16, 16])
100
+ # torch.Size([1, 512, 16, 16])
101
+
102
+ print(o.shape)
103
+ ```
104
+
105
+ ### Image Embeddings
106
+ ```python
107
+ from urllib.request import urlopen
108
+ from PIL import Image
109
+ import timm
110
+
111
+ img = Image.open(urlopen(
112
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
113
+ ))
114
+
115
+ model = timm.create_model(
116
+ 'vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k',
117
+ pretrained=True,
118
+ num_classes=0, # remove classifier nn.Linear
119
+ )
120
+ model = model.eval()
121
+
122
+ # get model specific transforms (normalization, resize)
123
+ data_config = timm.data.resolve_model_data_config(model)
124
+ transforms = timm.data.create_transform(**data_config, is_training=False)
125
+
126
+ output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
127
+
128
+ # or equivalently (without needing to set num_classes=0)
129
+
130
+ output = model.forward_features(transforms(img).unsqueeze(0))
131
+ # output is unpooled, a (1, 257, 512) shaped tensor
132
+
133
+ output = model.forward_head(output, pre_logits=True)
134
+ # output is a (1, num_features) shaped tensor
135
+ ```
136
+
137
+ ## Model Comparison
138
+ Explore the dataset and runtime metrics of this model in timm [model results](https://github.com/huggingface/pytorch-image-models/tree/main/results).
139
+
140
+ ## Citation
141
+ ```bibtex
142
+ @misc{rw2019timm,
143
+ author = {Ross Wightman},
144
+ title = {PyTorch Image Models},
145
+ year = {2019},
146
+ publisher = {GitHub},
147
+ journal = {GitHub repository},
148
+ doi = {10.5281/zenodo.4414861},
149
+ howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
150
+ }
151
+ ```
152
+ ```bibtex
153
+ @article{EVA02,
154
+ title={EVA-02: A Visual Representation for Neon Genesis},
155
+ author={Fang, Yuxin and Sun, Quan and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
156
+ journal={arXiv preprint arXiv:2303.11331},
157
+ year={2023}
158
+ }
159
+ ```
160
+ ```bibtex
161
+ @article{darcet2023vision,
162
+ title={Vision Transformers Need Registers},
163
+ author={Darcet, Timoth{'e}e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
164
+ journal={arXiv preprint arXiv:2309.16588},
165
+ year={2023}
166
+ }
167
+ ```
168
+ ```bibtex
169
+ @article{dosovitskiy2020vit,
170
+ title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
171
+ author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
172
+ journal={ICLR},
173
+ year={2021}
174
+ }
175
+ ```
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architecture": "vit_mediumd_patch16_rope_reg1_gap_256",
3
+ "num_classes": 1000,
4
+ "num_features": 512,
5
+ "global_pool": "avg",
6
+ "pretrained_cfg": {
7
+ "tag": "sbb_in1k",
8
+ "custom_load": false,
9
+ "input_size": [
10
+ 3,
11
+ 256,
12
+ 256
13
+ ],
14
+ "fixed_input_size": true,
15
+ "interpolation": "bicubic",
16
+ "crop_pct": 0.95,
17
+ "crop_mode": "center",
18
+ "mean": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "std": [
24
+ 0.5,
25
+ 0.5,
26
+ 0.5
27
+ ],
28
+ "num_classes": 1000,
29
+ "pool_size": null,
30
+ "first_conv": "patch_embed.proj",
31
+ "classifier": "head",
32
+ "license": "mit"
33
+ }
34
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:def8b495e3751dbdd06ef4a1a6a7dd33f10d8cbb5b948f8f737ec7fd62e9349b
3
+ size 255807376
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a7b8dbe08a012140abbcf4f12542c161594f1e6079114d0fb5c1d29fa5594ff1
3
+ size 255879094