nielsr HF staff commited on
Commit
9a8554a
1 Parent(s): f567a07

Add model card

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ datasets:
5
+ - imagenet-1k
6
+ ---
7
+
8
+ # Vision Transformer (base-sized model, patch size 16) trained using DINO
9
+
10
+ Vision Transformer (ViT) model trained using the DINO method. It was introduced in the paper [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/abs/2010.11929) by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin and first released in [this repository](https://github.com/facebookresearch/dino).
11
+
12
+ Disclaimer: The team releasing DINO did not write a model card for this model so this model card has been written by the Hugging Face team.
13
+
14
+ ## Model description
15
+
16
+ The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion, namely ImageNet-1k, at a resolution of 224x224 pixels.
17
+
18
+ Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
19
+
20
+ Note that this model does not include any fine-tuned heads.
21
+
22
+ By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
23
+
24
+ ## Intended uses & limitations
25
+
26
+ You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=google/vit) to look for
27
+ fine-tuned versions on a task that interests you.
28
+
29
+ ### How to use
30
+
31
+ Here is how to use this model:
32
+
33
+ ```python
34
+ from transformers import ViTFeatureExtractor, ViTModel
35
+ from PIL import Image
36
+ import requests
37
+
38
+ url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
39
+ image = Image.open(requests.get(url, stream=True).raw)
40
+
41
+ feature_extractor = ViTFeatureExtractor.from_pretrained('facebook/dino-vitb16')
42
+ model = ViTModel.from_pretrained('facebook/dino-vitb16')
43
+ inputs = feature_extractor(images=image, return_tensors="pt")
44
+ outputs = model(**inputs)
45
+ last_hidden_states = outputs.last_hidden_state
46
+ ```
47
+
48
+ ### BibTeX entry and citation info
49
+
50
+ ```bibtex
51
+ @article{DBLP:journals/corr/abs-2104-14294,
52
+ author = {Mathilde Caron and
53
+ Hugo Touvron and
54
+ Ishan Misra and
55
+ Herv{\'{e}} J{\'{e}}gou and
56
+ Julien Mairal and
57
+ Piotr Bojanowski and
58
+ Armand Joulin},
59
+ title = {Emerging Properties in Self-Supervised Vision Transformers},
60
+ journal = {CoRR},
61
+ volume = {abs/2104.14294},
62
+ year = {2021},
63
+ url = {https://arxiv.org/abs/2104.14294},
64
+ archivePrefix = {arXiv},
65
+ eprint = {2104.14294},
66
+ timestamp = {Tue, 04 May 2021 15:12:43 +0200},
67
+ biburl = {https://dblp.org/rec/journals/corr/abs-2104-14294.bib},
68
+ bibsource = {dblp computer science bibliography, https://dblp.org}
69
+ }
70
+ ```