kakaobrain
/

vit-l16-coyo-labeled-300m

Model card Files Files and versions Community

jun-untitled commited on Nov 24, 2022

Commit

9b8b722

•

1 Parent(s): ecd6e6e

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -19,13 +19,13 @@ inference: false
 # Vision Transformer (large-sized model)
-Vision Transformer (ViT) model pre-trained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/coyo-labeled-300m/) (300 million images, 21,841 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. However, since the JFT-300M is a private dataset, we tried to reproduce it using the publicly available [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/coyo-labeled-300m/) dataset.
 Thanks to Hugging Face team for converting weights of ViT trained in Tensorflow to be used on Pytorch, JAX/Flax and Tensorflow in Hugging Face.
 ## Model description
-The Vision Transformer (ViT) is a transformer model pretrained on a large collection of images in a supervised fashion, namely [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/coyo-labeled-300m/), at a resolution of 224x224 pixels.
 Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer.
@@ -57,7 +57,7 @@ WIP
 ## Training data
-The ViT model was pretrained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/coyo-labeled-300m/), a dataset consisting of 300 million images and 21k classes.
 ## Training procedure

 # Vision Transformer (large-sized model)
+Vision Transformer (ViT) model pre-trained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M) (300 million images, 21,841 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. However, since the JFT-300M is a private dataset, we tried to reproduce it using the publicly available [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M) dataset.
 Thanks to Hugging Face team for converting weights of ViT trained in Tensorflow to be used on Pytorch, JAX/Flax and Tensorflow in Hugging Face.
 ## Model description
+The Vision Transformer (ViT) is a transformer model pretrained on a large collection of images in a supervised fashion, namely [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M), at a resolution of 224x224 pixels.
 Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer.
 ## Training data
+The ViT model was pretrained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M), a dataset consisting of 300 million images and 21k classes.
 ## Training procedure