facebook
/

data2vec-vision-base-ft1k

+---
+license: apache-2.0
+tags:
+- image-classification
+- vision
+datasets:
+- imagenet
+- imagenet-1k
+---
+# Data2Vec-Vision (base-sized model, fine-tuned on ImageNet-1k)
+BEiT model pre-trained in a self-supervised fashion and fine-tuned on ImageNet-1k (1,2 million images, 1000 classes) at resolution 224x224. It was introduced in the paper [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli and first released in [this repository](https://github.com/facebookresearch/data2vec_vision/tree/main/beit).
+Disclaimer: The team releasing Facebook team did not write a model card for this model so this model card has been written by the Hugging Face team.
+## Pre-Training method
+![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/data2vec.png)
+For more information, please take a look at the [official paper](https://arxiv.org/abs/2202.03555).
+## Abstract
+*While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because
+they were developed with a single modality in
+mind. To get us closer to general self-supervised
+learning, we present data2vec, a framework that
+uses the same learning method for either speech,
+NLP or computer vision. The core idea is to predict latent representations of the full input data
+based on a masked view of the input in a selfdistillation setup using a standard Transformer architecture. Instead of predicting modality-specific
+targets such as words, visual tokens or units of
+human speech which are local in nature, data2vec
+predicts contextualized latent representations that
+contain information from the entire input. Experiments on the major benchmarks of speech
+recognition, image classification, and natural language understanding demonstrate a new state of
+the art or competitive performance to predominant approaches.*
+## Intended uses & limitations
+You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=data2vec-vision) to look for
+fine-tuned versions on a task that interests you.
+### How to use
+Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
+```python
+from transformers import BeitFeatureExtractor, Data2VecVisionForImageClassification
+from PIL import Image
+import requests
+url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
+image = Image.open(requests.get(url, stream=True).raw)
+feature_extractor = BeitFeatureExtractor.from_pretrained('facebook/data2vec-vision-base-ft1k')
+model = Data2VecVisionForImageClassification.from_pretrained('facebook/data2vec-vision-base-ft1k')
+inputs = feature_extractor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+logits = outputs.logits
+# model predicts one of the 1000 ImageNet classes
+predicted_class_idx = logits.argmax(-1).item()
+print("Predicted class:", model.config.id2label[predicted_class_idx])
+```
+Currently, both the feature extractor and model support PyTorch.
+## Training data
+The BEiT model was pretrained and fine-tuned on [ImageNet-1k](http://www.image-net.org/), a dataset consisting of 1,2 million images and 1k classes.
+## Training procedure
+### Preprocessing
+The exact details of preprocessing of images during training/validation can be found [here](https://github.com/microsoft/unilm/blob/master/beit/datasets.py).
+Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
+### Pretraining
+For all pre-training related hyperparameters, we refer to the [original paper](https://arxiv.org/abs/2106.08254) and the [original codebase](https://github.com/facebookresearch/data2vec_vision/tree/main/beit)
+## Evaluation results
+For evaluation results on several image classification benchmarks, we refer to tables 1 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution. Of course, increasing the model size will result in better performance.
+### BibTeX entry and citation info
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2202.03555,
+  doi = {10.48550/ARXIV.2202.03555},
+  url = {https://arxiv.org/abs/2202.03555},
+  author = {Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
+  keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```