patrickvonplaten commited on
Commit
38c2469
1 Parent(s): 46d3f9f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - image-classification
5
+ - vision
6
+ datasets:
7
+ - imagenet
8
+ - imagenet-1k
9
+ ---
10
+
11
+ # Data2Vec-Vision (base-sized model, fine-tuned on ImageNet-1k)
12
+
13
+ BEiT model pre-trained in a self-supervised fashion and fine-tuned on ImageNet-1k (1,2 million images, 1000 classes) at resolution 224x224. It was introduced in the paper [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli and first released in [this repository](https://github.com/facebookresearch/data2vec_vision/tree/main/beit).
14
+
15
+ Disclaimer: The team releasing Facebook team did not write a model card for this model so this model card has been written by the Hugging Face team.
16
+
17
+ ## Pre-Training method
18
+
19
+ ![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/data2vec.png)
20
+
21
+ For more information, please take a look at the [official paper](https://arxiv.org/abs/2202.03555).
22
+
23
+ ## Abstract
24
+
25
+ *While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because
26
+ they were developed with a single modality in
27
+ mind. To get us closer to general self-supervised
28
+ learning, we present data2vec, a framework that
29
+ uses the same learning method for either speech,
30
+ NLP or computer vision. The core idea is to predict latent representations of the full input data
31
+ based on a masked view of the input in a selfdistillation setup using a standard Transformer architecture. Instead of predicting modality-specific
32
+ targets such as words, visual tokens or units of
33
+ human speech which are local in nature, data2vec
34
+ predicts contextualized latent representations that
35
+ contain information from the entire input. Experiments on the major benchmarks of speech
36
+ recognition, image classification, and natural language understanding demonstrate a new state of
37
+ the art or competitive performance to predominant approaches.*
38
+
39
+ ## Intended uses & limitations
40
+
41
+ You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=data2vec-vision) to look for
42
+ fine-tuned versions on a task that interests you.
43
+
44
+ ### How to use
45
+
46
+ Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
47
+
48
+ ```python
49
+ from transformers import BeitFeatureExtractor, Data2VecVisionForImageClassification
50
+ from PIL import Image
51
+ import requests
52
+ url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
53
+ image = Image.open(requests.get(url, stream=True).raw)
54
+ feature_extractor = BeitFeatureExtractor.from_pretrained('facebook/data2vec-vision-base-ft1k')
55
+ model = Data2VecVisionForImageClassification.from_pretrained('facebook/data2vec-vision-base-ft1k')
56
+ inputs = feature_extractor(images=image, return_tensors="pt")
57
+ outputs = model(**inputs)
58
+ logits = outputs.logits
59
+ # model predicts one of the 1000 ImageNet classes
60
+ predicted_class_idx = logits.argmax(-1).item()
61
+ print("Predicted class:", model.config.id2label[predicted_class_idx])
62
+ ```
63
+
64
+ Currently, both the feature extractor and model support PyTorch.
65
+
66
+ ## Training data
67
+
68
+ The BEiT model was pretrained and fine-tuned on [ImageNet-1k](http://www.image-net.org/), a dataset consisting of 1,2 million images and 1k classes.
69
+
70
+ ## Training procedure
71
+
72
+ ### Preprocessing
73
+
74
+ The exact details of preprocessing of images during training/validation can be found [here](https://github.com/microsoft/unilm/blob/master/beit/datasets.py).
75
+
76
+ Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
77
+
78
+ ### Pretraining
79
+
80
+ For all pre-training related hyperparameters, we refer to the [original paper](https://arxiv.org/abs/2106.08254) and the [original codebase](https://github.com/facebookresearch/data2vec_vision/tree/main/beit)
81
+
82
+ ## Evaluation results
83
+
84
+ For evaluation results on several image classification benchmarks, we refer to tables 1 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution. Of course, increasing the model size will result in better performance.
85
+
86
+ ### BibTeX entry and citation info
87
+
88
+ ```bibtex
89
+ @misc{https://doi.org/10.48550/arxiv.2202.03555,
90
+ doi = {10.48550/ARXIV.2202.03555},
91
+ url = {https://arxiv.org/abs/2202.03555},
92
+ author = {Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
93
+ keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
94
+ title = {data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
95
+ publisher = {arXiv},
96
+ year = {2022},
97
+ copyright = {arXiv.org perpetual, non-exclusive license}
98
+ }
99
+ ```