nielsr HF staff commited on
Commit
09c08e9
·
1 Parent(s): 20795b6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ datasets:
5
+ - wikipedia
6
+ - c4
7
+ ---
8
+
9
+ # Perceiver IO for vision (convolutional processing)
10
+
11
+ Perceiver IO model pre-trained on ImageNet (14 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Jaegle et al. and first released in [this repository](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
12
+
13
+ Disclaimer: The team releasing Perceiver IO did not write a model card for this model so this model card has been written by the Hugging Face team.
14
+
15
+ ## Model description
16
+
17
+ Perceiver IO is a transformer encoder model that can be applied on any modality (text, images, audio, video, ...). The core idea is to employ the self-attention mechanism on a not-too-large set of latent vectors (e.g. 256 or 512), and only use the inputs to perform cross-attention with the latents. This allows for the time and memory requirements of the self-attention mechanism to not depend on the size of the inputs.
18
+
19
+ To decode, the authors employ so-called decoder queries, which allow to flexibly decode the final hidden states of the latents to produce outputs of arbitrary size and semantics. For image classification, the output is a tensor containing the logits, of shape (batch_size, num_labels).
20
+
21
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg" alt="drawing" width="600"/>
22
+
23
+ <small> Perceiver IO architecture.</small>
24
+
25
+ As the time and memory requirements of the self-attention mechanism don't depend on the size of the inputs, the Perceiver IO authors can train the model directly on raw pixel values, rather than on patches as is done in ViT. This particular model employs a simple 2D conv+maxpool preprocessing network on the pixel values, before using the inputs for cross-attention with the latents.
26
+
27
+ By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by replacing the classification decoder.
28
+
29
+ ## Intended uses & limitations
30
+
31
+ You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=deepmind/perceiver) to look for other fine-tuned versions on a task that may interest you.
32
+
33
+ ### How to use
34
+
35
+ Here is how to use this model in PyTorch:
36
+
37
+ ```python
38
+ from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationConvProcessing
39
+ import requests
40
+ from PIL import Image
41
+
42
+ feature_extractor = PerceiverFeatureExtractor.from_pretrained("deepmind/vision-perceiver-conv")
43
+ model = PerceiverForImageClassificationConvProcessing.from_pretrained("deepmind/vision-perceiver-conv")
44
+
45
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
46
+ image = Image.open(requests.get(url, stream=True).raw)
47
+
48
+ # prepare input
49
+ inputs = feature_extractor(image, return_tensors="pt").pixel_values
50
+ # forward pass
51
+ outputs = model(inputs)
52
+ logits = outputs.logits
53
+ print("Predicted class:", model.config.id2label[logits.argmax(-1).item()])
54
+ >>> should print Predicted class: tabby, tabby cat
55
+ ```
56
+
57
+ ## Training data
58
+
59
+ This model was pretrained on [ImageNet](http://www.image-net.org/), a dataset consisting of 14 million images and 1k classes.
60
+
61
+ ## Training procedure
62
+
63
+ ### Preprocessing
64
+
65
+ Images are center cropped and resized to a resolution of 224x224 and normalized across the RGB channels. Note that data augmentation was used during pre-training, as explained in Appendix H of the [paper](https://arxiv.org/abs/2107.14795).
66
+
67
+ ### Pretraining
68
+
69
+ Hyperparameter details can be found in Appendix H of the [paper](https://arxiv.org/abs/2107.14795).
70
+
71
+ ## Evaluation results
72
+
73
+ This model is able to achieve a top-1 accuracy of 82.1 on ImageNet-1k.
74
+
75
+ ### BibTeX entry and citation info
76
+
77
+ ```bibtex
78
+ @article{DBLP:journals/corr/abs-2107-14795,
79
+ author = {Andrew Jaegle and
80
+ Sebastian Borgeaud and
81
+ Jean{-}Baptiste Alayrac and
82
+ Carl Doersch and
83
+ Catalin Ionescu and
84
+ David Ding and
85
+ Skanda Koppula and
86
+ Daniel Zoran and
87
+ Andrew Brock and
88
+ Evan Shelhamer and
89
+ Olivier J. H{\'{e}}naff and
90
+ Matthew M. Botvinick and
91
+ Andrew Zisserman and
92
+ Oriol Vinyals and
93
+ Jo{\~{a}}o Carreira},
94
+ title = {Perceiver {IO:} {A} General Architecture for Structured Inputs {\&}
95
+ Outputs},
96
+ journal = {CoRR},
97
+ volume = {abs/2107.14795},
98
+ year = {2021},
99
+ url = {https://arxiv.org/abs/2107.14795},
100
+ eprinttype = {arXiv},
101
+ eprint = {2107.14795},
102
+ timestamp = {Tue, 03 Aug 2021 14:53:34 +0200},
103
+ biburl = {https://dblp.org/rec/journals/corr/abs-2107-14795.bib},
104
+ bibsource = {dblp computer science bibliography, https://dblp.org}
105
+ }
106
+ ```