patrickvonplaten commited on
Commit
bb08ce2
1 Parent(s): cc3a0b5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - image-classification
5
+ - vision
6
+ datasets:
7
+ - imagenet
8
+ - imagenet-1k
9
+ ---
10
+
11
+ # Data2Vec-Vision (base-sized model, pre-trained only)
12
+
13
+ BEiT model pre-trained in a self-supervised fashion on ImageNet-1k (1,2 million images, 1000 classes) at resolution 224x224. It was introduced in the paper [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli and first released in [this repository](https://github.com/facebookresearch/data2vec_vision/tree/main/beit).
14
+
15
+ Disclaimer: The team releasing Facebook team did not write a model card for this model so this model card has been written by the Hugging Face team.
16
+
17
+ ## Pre-Training method
18
+
19
+ ![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/data2vec.png)
20
+
21
+ For more information, please take a look at the [official paper](https://arxiv.org/abs/2202.03555).
22
+
23
+ ## Abstract
24
+
25
+ *While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because
26
+ they were developed with a single modality in
27
+ mind. To get us closer to general self-supervised
28
+ learning, we present data2vec, a framework that
29
+ uses the same learning method for either speech,
30
+ NLP or computer vision. The core idea is to predict latent representations of the full input data
31
+ based on a masked view of the input in a selfdistillation setup using a standard Transformer architecture. Instead of predicting modality-specific
32
+ targets such as words, visual tokens or units of
33
+ human speech which are local in nature, data2vec
34
+ predicts contextualized latent representations that
35
+ contain information from the entire input. Experiments on the major benchmarks of speech
36
+ recognition, image classification, and natural language understanding demonstrate a new state of
37
+ the art or competitive performance to predominant approaches.*
38
+
39
+ ## Intended uses & limitations
40
+
41
+ You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?other=data2vec-vision) to look for
42
+ fine-tuned versions on a task that interests you.
43
+
44
+ ## Training data
45
+
46
+ The BEiT model was pretrained on [ImageNet-1k](http://www.image-net.org/), a dataset consisting of 1,2 million images and 1k classes.
47
+
48
+ ## Training procedure
49
+
50
+ ### Preprocessing
51
+
52
+ The exact details of preprocessing of images during training/validation can be found [here](https://github.com/microsoft/unilm/blob/master/beit/datasets.py).
53
+
54
+ Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
55
+
56
+ ### Pretraining
57
+
58
+ For all pre-training related hyperparameters, we refer to the [original paper](https://arxiv.org/abs/2106.08254) and the [original codebase](https://github.com/facebookresearch/data2vec_vision/tree/main/beit)
59
+
60
+ ## Evaluation results
61
+
62
+ For evaluation results on several image classification benchmarks, we refer to tables 1 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution. Of course, increasing the model size will result in better performance.
63
+
64
+ ### BibTeX entry and citation info
65
+
66
+ ```bibtex
67
+ @misc{https://doi.org/10.48550/arxiv.2202.03555,
68
+ doi = {10.48550/ARXIV.2202.03555},
69
+ url = {https://arxiv.org/abs/2202.03555},
70
+ author = {Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
71
+ keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
72
+ title = {data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
73
+ publisher = {arXiv},
74
+ year = {2022},
75
+ copyright = {arXiv.org perpetual, non-exclusive license}
76
+ }
77
+ ```