vision
File size: 5,052 Bytes
ecd6e6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b8b722
ecd6e6e
 
 
 
 
9b8b722
ecd6e6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b8b722
ecd6e6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
#license: not yet
tags:
- vision
datasets:
- kakaobrain/coyo-labeled-300m
annotations_creators:
- machine-generated
license:
- apache-2.0
pretty_name: COYO-Labeled-300M
task_categories:
- image-classification
task_ids:
- multi-label-image-classification
- multi-class-image-classification
inference: false
---

# Vision Transformer (large-sized model) 

Vision Transformer (ViT) model pre-trained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M) (300 million images, 21,841 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. However, since the JFT-300M is a private dataset, we tried to reproduce it using the publicly available [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M) dataset.

Thanks to Hugging Face team for converting weights of ViT trained in Tensorflow to be used on Pytorch, JAX/Flax and Tensorflow in Hugging Face.

## Model description

The Vision Transformer (ViT) is a transformer model pretrained on a large collection of images in a supervised fashion, namely [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M), at a resolution of 224x224 pixels. 

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer.

This ViT model is pretrained with COYO-Labeled-300M at resolution 224x224. Please see details [here](https://github.com/kakaobrain/coyo-vit)

## Intended uses & limitations

You can use weights from ViT models for image classification, downstream. Codes for reproduction are also provided. Please see this [github repository](https://github.com/kakaobrain/coyo-vit) for pretraining and finetuning code.

### How to use

Here is how to use this model in PyTorch:

```python
WIP
```

Here is how to use this model in JAX/Flax:

```python
WIP
```

Here is how to use this model in Tensorflow:

```python
WIP
```

## Training data

The ViT model was pretrained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M), a dataset consisting of 300 million images and 21k classes. 

## Training procedure

### Preprocessing

The exact details of preprocessing of images during training/validation can be found [here](https://github.com/kakaobrain/coyo-vit). 

Images are inception-cropped to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

### Pretraining

The model was trained on TPUv3 hardware. All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. Pre-training resolution is 224.
More detail, please see [here](https://github.com/kakaobrain/coyo-vit)

### Evaluation results

| Model        | Upstream Dataset      | Resolution     | ImageNet (downstream)     | ImageNet-ReaL (dwonstream)     | Public     |
|----------    |-------------------    |------------    |-----------------------    |----------------------------    |--------    |
| ViT-L/16     | JFT-300M              | 512            | 87.76                     | 90.54                          | X          |
| ViT-L/16     | COYO-Labeled-300M     | 512            | 87.24 (-0.52)             | 90.03 (-0.51)                  | O          |
| ViT-L/16     | JFT-300M              | 384            | 87.12                     | 89.99                          | X          |
| ViT-L/16     | COYO-Labeled-300M     | 384            | 86.72 (-0.4)              | 89.84 (-0.15)                  | O          |

## Citation
```bibtex
@misc{kakaobrain2022coyo-vit,
  title         = {COYO-ViT},
  author        = {Lee, Sungjun and Park, Beomhee},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/coyo-vit}},
}
```
```bibtex
@misc{kakaobrain2022coyo-700m,
  title         = {COYO-700M: Image-Text Pair Dataset},
  author        = {Byeon, Minwoo and Park, Beomhee and Kim, Haecheon and Lee, Sungjun and Baek, Woonhyuk and Kim, Saehoon},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/coyo-dataset}},
}
```
```bibtex
@misc{dosovitskiy2020image,
    title   = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author  = {Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
    year    = {2020},
    eprint  = {2010.11929},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
```

## License
The source codes are licensed under Apache 2.0 License.