Transformers
PyTorch
perceiver
Inference Endpoints
nielsr HF staff commited on
Commit
d3bb7bf
1 Parent(s): 2196767

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ datasets:
5
+ - kinetics-700-2020
6
+ ---
7
+
8
+ # Perceiver IO for multimodal autoencoding
9
+
10
+ Perceiver IO model trained on [Kinetics-700-2020](https://arxiv.org/abs/2010.10864) for auto-encoding videos that consist of images, audio and a class label. It was introduced in the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Jaegle et al. and first released in [this repository](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
11
+
12
+ The goal of multimodal autoencoding is to learn a model that can accurately reconstruct multimodal inputs in the presence of a bottleneck induced by an architecture.
13
+
14
+ Disclaimer: The team releasing Perceiver IO did not write a model card for this model so this model card has been written by the Hugging Face team.
15
+
16
+ ## Model description
17
+
18
+ Perceiver IO is a transformer encoder model that can be applied on any modality (text, images, audio, video, ...). The core idea is to employ the self-attention mechanism on a not-too-large set of latent vectors (e.g. 256 or 512), and only use the inputs to perform cross-attention with the latents. This allows for the time and memory requirements of the self-attention mechanism to not depend on the size of the inputs.
19
+
20
+ To decode, the authors employ so-called decoder queries, which allow to flexibly decode the final hidden states of the latents to produce outputs of arbitrary size and semantics. For multimodal autoencoding, the output contains the reconstructions of the 3 modalities: images, audio and the class label.
21
+
22
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg" alt="drawing" width="600"/>
23
+
24
+ <small> Perceiver IO architecture.</small>
25
+
26
+ As the time and memory requirements of the self-attention mechanism don't depend on the size of the inputs, the Perceiver IO authors can train the model by padding the inputs with modality-specific embeddings and serialize all of them into a 2D input array (i.e. concatenate along the time dimension). Decoding the final hidden states of the latents is done by using queries containing Fourier-based position embeddings (for video and audio) and modality embeddings.
27
+
28
+ ## Intended uses & limitations
29
+
30
+ You can use the raw model for multimodal autoencoding. Note that by masking the classification label during evaluation, the auto-encoding model becomes a video classifier.
31
+
32
+ See the [model hub](https://huggingface.co/models search=deepmind/perceiver) to look for other versions on a task that may interest you.
33
+
34
+ ### How to use
35
+
36
+ We refer to the [tutorial notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Perceiver/Perceiver_for_Multimodal_Autoencoding.ipynb) regarding using the Perceiver for multimodal autoencoding.
37
+
38
+ ## Training data
39
+
40
+ This model was trained on [Kinetics-700-200](https://arxiv.org/abs/2010.10864), a dataset consisting of videos that belong to one of 700 classes.
41
+
42
+ ## Training procedure
43
+
44
+ ### Preprocessing
45
+
46
+ The authors train on 16 frames at 224x224 resolution, preprocessed into 50k 4x4 patches as well as 30k raw audio samples, patched into a total of 1920 16-dimensional vectors and one 700-dimensional one-hot representation of the class label.
47
+
48
+ ### Pretraining
49
+
50
+ Hyperparameter details can be found in Appendix F of the [paper](https://arxiv.org/abs/2107.14795).
51
+
52
+ ## Evaluation results
53
+
54
+ For evaluation results, we refer to table 5 of the [paper](https://arxiv.org/abs/2107.14795).
55
+
56
+ ### BibTeX entry and citation info
57
+
58
+ ```bibtex
59
+ @article{DBLP:journals/corr/abs-2107-14795,
60
+ author = {Andrew Jaegle and
61
+ Sebastian Borgeaud and
62
+ Jean{-}Baptiste Alayrac and
63
+ Carl Doersch and
64
+ Catalin Ionescu and
65
+ David Ding and
66
+ Skanda Koppula and
67
+ Daniel Zoran and
68
+ Andrew Brock and
69
+ Evan Shelhamer and
70
+ Olivier J. H{\'{e}}naff and
71
+ Matthew M. Botvinick and
72
+ Andrew Zisserman and
73
+ Oriol Vinyals and
74
+ Jo{\~{a}}o Carreira},
75
+ title = {Perceiver {IO:} {A} General Architecture for Structured Inputs {\&}
76
+ Outputs},
77
+ journal = {CoRR},
78
+ volume = {abs/2107.14795},
79
+ year = {2021},
80
+ url = {https://arxiv.org/abs/2107.14795},
81
+ eprinttype = {arXiv},
82
+ eprint = {2107.14795},
83
+ timestamp = {Tue, 03 Aug 2021 14:53:34 +0200},
84
+ biburl = {https://dblp.org/rec/journals/corr/abs-2107-14795.bib},
85
+ bibsource = {dblp computer science bibliography, https://dblp.org}
86
+ }
87
+ ```