agiera julien-c HF staff commited on
Commit
0ac3561
0 Parent(s):

Duplicate from facebook/dino-vitb16

Browse files

Co-authored-by: Julien Chaumond <julien-c@users.noreply.huggingface.co>

Files changed (6) hide show
  1. .gitattributes +27 -0
  2. README.md +73 -0
  3. config.json +20 -0
  4. preprocessor_config.json +17 -0
  5. pytorch_model.bin +3 -0
  6. tf_model.h5 +3 -0
.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - dino
5
+ - vision
6
+ datasets:
7
+ - imagenet-1k
8
+ ---
9
+
10
+ # Vision Transformer (base-sized model, patch size 16) trained using DINO
11
+
12
+ Vision Transformer (ViT) model trained using the DINO method. It was introduced in the paper [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.14294) by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin and first released in [this repository](https://github.com/facebookresearch/dino).
13
+
14
+ Disclaimer: The team releasing DINO did not write a model card for this model so this model card has been written by the Hugging Face team.
15
+
16
+ ## Model description
17
+
18
+ The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion, namely ImageNet-1k, at a resolution of 224x224 pixels.
19
+
20
+ Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
21
+
22
+ Note that this model does not include any fine-tuned heads.
23
+
24
+ By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
25
+
26
+ ## Intended uses & limitations
27
+
28
+ You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=google/vit) to look for
29
+ fine-tuned versions on a task that interests you.
30
+
31
+ ### How to use
32
+
33
+ Here is how to use this model:
34
+
35
+ ```python
36
+ from transformers import ViTImageProcessor, ViTModel
37
+ from PIL import Image
38
+ import requests
39
+
40
+ url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
41
+ image = Image.open(requests.get(url, stream=True).raw)
42
+
43
+ processor = ViTImageProcessor.from_pretrained('facebook/dino-vitb16')
44
+ model = ViTModel.from_pretrained('facebook/dino-vitb16')
45
+
46
+ inputs = processor(images=image, return_tensors="pt")
47
+ outputs = model(**inputs)
48
+ last_hidden_states = outputs.last_hidden_state
49
+ ```
50
+
51
+ ### BibTeX entry and citation info
52
+
53
+ ```bibtex
54
+ @article{DBLP:journals/corr/abs-2104-14294,
55
+ author = {Mathilde Caron and
56
+ Hugo Touvron and
57
+ Ishan Misra and
58
+ Herv{\'{e}} J{\'{e}}gou and
59
+ Julien Mairal and
60
+ Piotr Bojanowski and
61
+ Armand Joulin},
62
+ title = {Emerging Properties in Self-Supervised Vision Transformers},
63
+ journal = {CoRR},
64
+ volume = {abs/2104.14294},
65
+ year = {2021},
66
+ url = {https://arxiv.org/abs/2104.14294},
67
+ archivePrefix = {arXiv},
68
+ eprint = {2104.14294},
69
+ timestamp = {Tue, 04 May 2021 15:12:43 +0200},
70
+ biburl = {https://dblp.org/rec/journals/corr/abs-2104-14294.bib},
71
+ bibsource = {dblp computer science bibliography, https://dblp.org}
72
+ }
73
+ ```
config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ViTModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.0,
6
+ "hidden_act": "gelu",
7
+ "hidden_dropout_prob": 0.0,
8
+ "hidden_size": 768,
9
+ "image_size": 224,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "model_type": "vit",
14
+ "num_attention_heads": 12,
15
+ "num_channels": 3,
16
+ "num_hidden_layers": 12,
17
+ "patch_size": 16,
18
+ "torch_dtype": "float32",
19
+ "transformers_version": "4.10.0.dev0"
20
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "do_resize": true,
4
+ "feature_extractor_type": "ViTFeatureExtractor",
5
+ "image_mean": [
6
+ 0.485,
7
+ 0.456,
8
+ 0.406
9
+ ],
10
+ "image_std": [
11
+ 0.229,
12
+ 0.224,
13
+ 0.225
14
+ ],
15
+ "resample": 2,
16
+ "size": 224
17
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a064e36c67289caaa5c949c0b3f7f31a0fcbcba5721f5fa12419933ec1f4fe6e
3
+ size 343268597
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1765bdd93da60ef9f97f927cf10647a467f46f4149975a951ef24298ce3d6231
3
+ size 345823752