vision
jun-untitled commited on
Commit
9b8b722
1 Parent(s): ecd6e6e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -19,13 +19,13 @@ inference: false
19
 
20
  # Vision Transformer (large-sized model)
21
 
22
- Vision Transformer (ViT) model pre-trained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/coyo-labeled-300m/) (300 million images, 21,841 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. However, since the JFT-300M is a private dataset, we tried to reproduce it using the publicly available [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/coyo-labeled-300m/) dataset.
23
 
24
  Thanks to Hugging Face team for converting weights of ViT trained in Tensorflow to be used on Pytorch, JAX/Flax and Tensorflow in Hugging Face.
25
 
26
  ## Model description
27
 
28
- The Vision Transformer (ViT) is a transformer model pretrained on a large collection of images in a supervised fashion, namely [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/coyo-labeled-300m/), at a resolution of 224x224 pixels.
29
 
30
  Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer.
31
 
@@ -57,7 +57,7 @@ WIP
57
 
58
  ## Training data
59
 
60
- The ViT model was pretrained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/coyo-labeled-300m/), a dataset consisting of 300 million images and 21k classes.
61
 
62
  ## Training procedure
63
 
 
19
 
20
  # Vision Transformer (large-sized model)
21
 
22
+ Vision Transformer (ViT) model pre-trained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M) (300 million images, 21,841 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. However, since the JFT-300M is a private dataset, we tried to reproduce it using the publicly available [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M) dataset.
23
 
24
  Thanks to Hugging Face team for converting weights of ViT trained in Tensorflow to be used on Pytorch, JAX/Flax and Tensorflow in Hugging Face.
25
 
26
  ## Model description
27
 
28
+ The Vision Transformer (ViT) is a transformer model pretrained on a large collection of images in a supervised fashion, namely [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M), at a resolution of 224x224 pixels.
29
 
30
  Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer.
31
 
 
57
 
58
  ## Training data
59
 
60
+ The ViT model was pretrained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M), a dataset consisting of 300 million images and 21k classes.
61
 
62
  ## Training procedure
63