nielsr HF staff commited on
Commit
f883c12
1 Parent(s): 18592a2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ datasets:
5
+ - autoflow
6
+ ---
7
+
8
+ # Perceiver IO for optical flow
9
+
10
+ Perceiver IO model trained on [AutoFlow](https://autoflow-google.github.io/). It was introduced in the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Jaegle et al. and first released in [this repository](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
11
+
12
+ Optical flow is a decades-old open problem in computer vision. Given two images of the same scene (e.g. two consecutive frames of a video), the task is to estimate the 2D displacement for each pixel in the first image. This has many broader applications, such as navigation and visual odometry in robots, estimation of 3D geometry, and even to aid transfer of more complex, learned inference such as 3D human pose estimation from synthetic to real images.
13
+
14
+ Disclaimer: The team releasing Perceiver IO did not write a model card for this model so this model card has been written by the Hugging Face team.
15
+
16
+ ## Model description
17
+
18
+ Perceiver IO is a transformer encoder model that can be applied on any modality (text, images, audio, video, ...). The core idea is to employ the self-attention mechanism on a not-too-large set of latent vectors (e.g. 256 or 512), and only use the inputs to perform cross-attention with the latents. This allows for the time and memory requirements of the self-attention mechanism to not depend on the size of the inputs.
19
+
20
+ To decode, the authors employ so-called decoder queries, which allow to flexibly decode the final hidden states of the latents to produce outputs of arbitrary size and semantics. For image classification, the output is a tensor containing the logits, of shape (batch_size, num_labels).
21
+
22
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg" alt="drawing" width="600"/>
23
+
24
+ <small> Perceiver IO architecture.</small>
25
+
26
+ As the time and memory requirements of the self-attention mechanism don't depend on the size of the inputs, the Perceiver IO authors can train the model on raw pixel values, by concatening a pair of images along the channel dimension.
27
+
28
+ The model obtains state-of-the-art results on important optical flow benchmarks, including [Sintel](http://sintel.is.tue.mpg.de/) and [KITTI](http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=flow).
29
+
30
+ ## Intended uses & limitations
31
+
32
+ You can use the raw model for predicting optical flow between a pair of images. See the [model hub](https://huggingface.co/models?search=deepmind/perceiver) to look for other versions on a task that may interest you.
33
+
34
+ ### How to use
35
+
36
+ We refer to the [tutorial notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Perceiver/Perceiver_for_Optical_Flow.ipynb) regarding using the Perceiver for optical flow.
37
+
38
+ ## Training data
39
+
40
+ This model was trained on [AutoFlow](https://autoflow-google.github.io/), a synthetic dataset consisting of 400,000 annotated image pairs.
41
+
42
+ ## Training procedure
43
+
44
+ ### Preprocessing
45
+
46
+ Frames are resized to a resolution of 368x496. The authors concatenate the frames along the channel dimension and extract a 3x3 patch around each pixel (leading to 3x3x3x2 = 54 values for each pixel).
47
+
48
+ ### Pretraining
49
+
50
+ Hyperparameter details can be found in Appendix E of the [paper](https://arxiv.org/abs/2107.14795).
51
+
52
+ ## Evaluation results
53
+
54
+ The model achieves a average end-point error (EPE) of 1.81 on Sintel.clean, 2.42 on Sintel.final and 4.98 on KITTI. For evaluation results, we refer to table 4 of the [paper](https://arxiv.org/abs/2107.14795).
55
+
56
+ ### BibTeX entry and citation info
57
+
58
+ ```bibtex
59
+ @article{DBLP:journals/corr/abs-2107-14795,
60
+ author = {Andrew Jaegle and
61
+ Sebastian Borgeaud and
62
+ Jean{-}Baptiste Alayrac and
63
+ Carl Doersch and
64
+ Catalin Ionescu and
65
+ David Ding and
66
+ Skanda Koppula and
67
+ Daniel Zoran and
68
+ Andrew Brock and
69
+ Evan Shelhamer and
70
+ Olivier J. H{\'{e}}naff and
71
+ Matthew M. Botvinick and
72
+ Andrew Zisserman and
73
+ Oriol Vinyals and
74
+ Jo{\~{a}}o Carreira},
75
+ title = {Perceiver {IO:} {A} General Architecture for Structured Inputs {\&}
76
+ Outputs},
77
+ journal = {CoRR},
78
+ volume = {abs/2107.14795},
79
+ year = {2021},
80
+ url = {https://arxiv.org/abs/2107.14795},
81
+ eprinttype = {arXiv},
82
+ eprint = {2107.14795},
83
+ timestamp = {Tue, 03 Aug 2021 14:53:34 +0200},
84
+ biburl = {https://dblp.org/rec/journals/corr/abs-2107-14795.bib},
85
+ bibsource = {dblp computer science bibliography, https://dblp.org}
86
+ }
87
+ ```