rwightman HF staff commited on
Commit
56afc98
1 Parent(s): 0c5e347

Update model config and README

Browse files
Files changed (3) hide show
  1. README.md +106 -2
  2. config.json +1 -1
  3. model.safetensors +3 -0
README.md CHANGED
@@ -2,6 +2,110 @@
2
  tags:
3
  - image-classification
4
  - timm
5
- library_tag: timm
 
 
 
 
6
  ---
7
- # Model card for timm/vit_medium_patch16_gap_384.in12k_ft_in1k
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - image-classification
4
  - timm
5
+ library_name: timm
6
+ license: apache-2.0
7
+ datasets:
8
+ - imagenet-1k
9
+ - imagenet-12k
10
  ---
11
+ # Model card for vit_medium_patch16_gap_384.sw_in12k_ft_in1k
12
+
13
+ A Vision Transformer (ViT) image classification model. This is a `timm` specific variation of the architecture with token global average pooling. Pretrained on ImageNet-12k and fine-tuned on ImageNet-1k by Ross Wightman in `timm` using recipe template described below.
14
+
15
+ Recipe details:
16
+ * Based on Swin Transformer train / pretrain recipe with modifications (related to both DeiT and ConvNeXt recipes)
17
+ * AdamW optimizer, gradient clipping, EMA weight averaging
18
+ * Cosine LR schedule with warmup
19
+
20
+
21
+ ## Model Details
22
+ - **Model Type:** Image classification / feature backbone
23
+ - **Model Stats:**
24
+ - Params (M): 39.0
25
+ - GMACs: 22.0
26
+ - Activations (M): 32.1
27
+ - Image size: 384 x 384
28
+ - **Papers:**
29
+ - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
30
+ - **Dataset:** ImageNet-1k
31
+ - **Pretrain Dataset:** ImageNet-12k
32
+ - **Original:** https://github.com/huggingface/pytorch-image-models
33
+
34
+ ## Model Usage
35
+ ### Image Classification
36
+ ```python
37
+ from urllib.request import urlopen
38
+ from PIL import Image
39
+ import timm
40
+
41
+ img = Image.open(urlopen(
42
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
43
+ ))
44
+
45
+ model = timm.create_model('vit_medium_patch16_gap_384.sw_in12k_ft_in1k', pretrained=True)
46
+ model = model.eval()
47
+
48
+ # get model specific transforms (normalization, resize)
49
+ data_config = timm.data.resolve_model_data_config(model)
50
+ transforms = timm.data.create_transform(**data_config, is_training=False)
51
+
52
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
53
+
54
+ top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
55
+ ```
56
+
57
+ ### Image Embeddings
58
+ ```python
59
+ from urllib.request import urlopen
60
+ from PIL import Image
61
+ import timm
62
+
63
+ img = Image.open(urlopen(
64
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
65
+ ))
66
+
67
+ model = timm.create_model(
68
+ 'vit_medium_patch16_gap_384.sw_in12k_ft_in1k',
69
+ pretrained=True,
70
+ num_classes=0, # remove classifier nn.Linear
71
+ )
72
+ model = model.eval()
73
+
74
+ # get model specific transforms (normalization, resize)
75
+ data_config = timm.data.resolve_model_data_config(model)
76
+ transforms = timm.data.create_transform(**data_config, is_training=False)
77
+
78
+ output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
79
+
80
+ # or equivalently (without needing to set num_classes=0)
81
+
82
+ output = model.forward_features(transforms(img).unsqueeze(0))
83
+ # output is unpooled, a (1, 576, 512) shaped tensor
84
+
85
+ output = model.forward_head(output, pre_logits=True)
86
+ # output is a (1, num_features) shaped tensor
87
+ ```
88
+
89
+ ## Model Comparison
90
+ Explore the dataset and runtime metrics of this model in timm [model results](https://github.com/huggingface/pytorch-image-models/tree/main/results).
91
+
92
+ ## Citation
93
+ ```bibtex
94
+ @misc{rw2019timm,
95
+ author = {Ross Wightman},
96
+ title = {PyTorch Image Models},
97
+ year = {2019},
98
+ publisher = {GitHub},
99
+ journal = {GitHub repository},
100
+ doi = {10.5281/zenodo.4414861},
101
+ howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
102
+ }
103
+ ```
104
+ ```bibtex
105
+ @article{dosovitskiy2020vit,
106
+ title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
107
+ author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
108
+ journal={ICLR},
109
+ year={2021}
110
+ }
111
+ ```
config.json CHANGED
@@ -4,7 +4,7 @@
4
  "num_features": 512,
5
  "global_pool": "avg",
6
  "pretrained_cfg": {
7
- "tag": "in12k_ft_in1k",
8
  "custom_load": false,
9
  "input_size": [
10
  3,
 
4
  "num_features": 512,
5
  "global_pool": "avg",
6
  "pretrained_cfg": {
7
+ "tag": "sw_in12k_ft_in1k",
8
  "custom_load": false,
9
  "input_size": [
10
  3,
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d7a9d530dfc6a6d0dd34139f8badec95f833b10e07a8a751dee45d5bc7f53b4
3
+ size 156115396