File size: 5,252 Bytes
7ac28a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# Global Context Vision Transformer (GC ViT)

This model contains the official PyTorch implementation of **Global Context Vision Transformers** (ICML2023) \
 \
[Global Context Vision
Transformers](https://arxiv.org/pdf/2206.09959.pdf) \
[Ali Hatamizadeh](https://research.nvidia.com/person/ali-hatamizadeh),
[Hongxu (Danny) Yin](https://scholar.princeton.edu/hongxu),
[Greg Heinrich](https://developer.nvidia.com/blog/author/gheinrich/),
[Jan Kautz](https://jankautz.com/), 
and [Pavlo Molchanov](https://www.pmolchanov.com/).

GC ViT  achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, GC ViT variants with `51M`, `90M` and `201M` parameters achieve `84.3`, `85.9` and `85.7` Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer.

<p align="center">
<img src="https://github.com/NVlabs/GCVit/assets/26806394/d1820d6d-3aef-470e-a1d3-af370f1c1f77" width=63% height=63% 
class="center">
</p>


The architecture of GC ViT is demonstrated in the following:

![gc_vit](https://github.com/NVlabs/GCVit/assets/26806394/86ca853e-56bc-4907-b3e3-0c4611ef9073)


## Introduction

**GC ViT** leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive 
operations such as computing attention masks or shifting local windows.


<p align="center">
<img src="https://github.com/NVlabs/GCVit/assets/26806394/da64f22a-e7af-4577-8884-b08ba4e24e49" width=72% height=72% 
class="center">
</p>


## ImageNet Benchmarks


**ImageNet-1K Pretrained Models**

<table>
  <tr>
    <th>Model Variant</th>
    <th>Acc@1</th>
    <th>#Params(M)</th>
    <th>FLOPs(G)</th>
    <th>Download</th>
  </tr>
<tr>
    <td>GC ViT-XXT</td>
    <th>79.9</th>
    <td>12</td>
    <td>2.1</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1apSIWQCa5VhWLJws8ugMTuyKzyayw4Eh">model</a></td>
</tr>
<tr>
    <td>GC ViT-XT</td>
    <th>82.0</th>
    <td>20</td>
    <td>2.6</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1OgSbX73AXmE0beStoJf2Jtda1yin9t9m">model</a></td>
</tr>
<tr>
    <td>GC ViT-T</td>
    <th>83.5</th>
    <td>28</td>
    <td>4.7</td>
    <td><a href="https://drive.google.com/uc?export=download&id=11M6AsxKLhfOpD12Nm_c7lOvIIAn9cljy">model</a></td>
</tr>
<tr>
    <td>GC ViT-T2</td>
    <th>83.7</th>
    <td>34</td>
    <td>5.5</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1cTD8VemWFiwAx0FB9cRMT-P4vRuylvmQ">model</a></td>
</tr>
<tr>
    <td>GC ViT-S</td>
    <th>84.3</th>
    <td>51</td>
    <td>8.5</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1Nn6ABKmYjylyWC0I41Q3oExrn4fTzO9Y">model</a></td>
</tr>
<tr>
    <td>GC ViT-S2</td>
    <th>84.8</th>
    <td>68</td>
    <td>10.7</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1E5TtYpTqILznjBLLBTlO5CGq343RbEan">model</a></td>
</tr>
<tr>
    <td>GC ViT-B</td>
    <th>85.0</th>
    <td>90</td>
    <td>14.8</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1PF7qfxKLcv_ASOMetDP75n8lC50gaqyH">model</a></td>
</tr>

<tr>
    <td>GC ViT-L</td>
    <th>85.7</th>
    <td>201</td>
    <td>32.6</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1Lkz1nWKTwCCUR7yQJM6zu_xwN1TR0mxS">model</a></td>
</tr>

</table>


**ImageNet-21K Pretrained Models**

<table>
  <tr>
    <th>Model Variant</th>
    <th>Resolution</th>
    <th>Acc@1</th>
    <th>#Params(M)</th>
    <th>FLOPs(G)</th>
    <th>Download</th>
  </tr>
<tr>
    <td>GC ViT-L</td>
    <td>224 x 224</td>
    <th>86.6</th>
    <td>201</td>
    <td>32.6</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1maGDr6mJkLyRTUkspMzCgSlhDzNRFGEf">model</a></td>
</tr>
<tr>
    <td>GC ViT-L</td>
    <td>384 x 384</td>
    <th>87.4</th>
    <td>201</td>
    <td>120.4</td>
    <td><a href="https://drive.google.com/uc?export=download&id=1P-IEhvQbJ3FjnunVkM1Z9dEpKw-tsuWv">model</a></td>
</tr>

</table>


## Citation

Please consider citing GC ViT paper if it is useful for your work:

```
@inproceedings{hatamizadeh2023global,
  title={Global context vision transformers},
  author={Hatamizadeh, Ali and Yin, Hongxu and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
  booktitle={International Conference on Machine Learning},
  pages={12633--12646},
  year={2023},
  organization={PMLR}
}
```

## Licenses

Copyright © 2023, NVIDIA Corporation. All rights reserved.

This work is made available under the Nvidia Source Code License-NC. Click [here](LICENSE) to view a copy of this license.

The pre-trained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

For license information regarding the timm, please refer to its [repository](https://github.com/rwightman/pytorch-image-models).

For license information regarding the ImageNet dataset, please refer to the ImageNet [official website](https://www.image-net.org/).