ariG23498 HF Staff commited on
Commit
5921386
·
1 Parent(s): aa5a45a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -78
README.md CHANGED
@@ -9,13 +9,11 @@ pinned: false
9
 
10
  | DINO | Dog |
11
  | :--: | :--: |
12
- | ![](https://i.imgur.com/kAShjbs.gif) | ![](https://i.imgur.com/xrMkCbg.gif) |
13
-
14
-
15
- [![TensorFlow 2.8](https://img.shields.io/badge/TensorFlow-2.8-FF6F00?logo=tensorflow)](https://github.com/tensorflow/tensorflow/releases/tag/v2.8.0)
16
 
17
  _By [Aritra Roy Gosthipaty](https://github.com/ariG23498) and [Sayak Paul](https://github.com/sayakpaul) (equal contribution)_
18
 
 
19
  We probe into the representations learned by different families of Vision Transformers (supervised pre-training with ImageNet-21k, ImageNet-1k, distillation, self-supervised pre-training):
20
 
21
  * Original ViT [1]
@@ -24,76 +22,7 @@ We probe into the representations learned by different families of Vision Transf
24
 
25
  We hope these tools will prove to be useful for the community. Please follow along with [this post on keras.io](https://keras.io/examples/vision/probing_vits/) for a better navigation through the repository.
26
 
27
-
28
- ## Self-attention visualization
29
-
30
- | Original Image | Attention Maps | Attention Maps Overlayed |
31
- | :--: | :--: | :--: |
32
- | ![original image](./assets/bird.png) | ![attention maps](./assets/dino_attention_heads_inferno.png) | ![attention maps overlay](./assets/dino_attention_heads.png) |
33
-
34
- https://user-images.githubusercontent.com/36856589/162609884-8e51156e-d461-421d-9f8a-4d4e48967bd6.mp4
35
-
36
- <small><a href=https://www.pexels.com/video/a-computer-generated-walking-dinosaur-4096297/>Original Video Source</a></small>
37
-
38
- https://user-images.githubusercontent.com/36856589/162609907-4e432dc4-a731-40f4-9a20-94e0c8f648bc.mp4
39
-
40
- <small><a href=https://www.pexels.com/video/a-dog-running-in-a-grass-field-4166343/>Original Video Source</a></small>
41
-
42
- ## Supervised salient representations
43
-
44
- In the [DINO](https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training/) blog post, the authors show a video with the following caption:
45
-
46
- > The original video is shown on the left. In the middle is a segmentation example generated by a supervised model, and on the right is one generated by DINO.
47
-
48
- A screenshot of the video is as follows:
49
-
50
- <img width="764" alt="image" src="https://user-images.githubusercontent.com/36856589/162615199-b5133e51-460e-4864-a83e-5b8007339ff7.png"><br>
51
-
52
-
53
- We obtain the attention maps generated with the supervised pre-trained model and find that they are not that salient w.r.t the DINO model. We observe a similar behaviour in our experiments as well. The figure below shows the attention heatmaps extracted with
54
- a ViT-B16 model pre-trained (supervised) using ImageNet-1k:
55
-
56
- | Dinosaur | Dog |
57
- | :--: | :--: |
58
- | ![](./assets/supervised-dino.gif) | ![](./assets/supervised-dog.gif) |
59
-
60
- We used this [Colab Notebook](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/vitb16-attention-maps-video.ipynb) to conduct this experiment.
61
-
62
- ## Hugging Face Spaces
63
-
64
- You can now probe into the ViTs with your own input images.
65
-
66
- | Attention Heat Maps | Attention Rollout |
67
- | :--: | :--: |
68
- | [![Generic badge](https://img.shields.io/badge/🤗%20Spaces-Attention%20Heat%20Maps-black.svg)](https://huggingface.co/spaces/probing-vits/attention-heat-maps) | [![Generic badge](https://img.shields.io/badge/🤗%20Spaces-Attention%20Rollout-black.svg)](https://huggingface.co/spaces/probing-vits/attention-rollout) |
69
-
70
- ## Visualizing mean attention distances
71
-
72
- <div align="center">
73
- <img src="./assets/vit_base_i21k_patch16_224.png" width=450/>
74
- </div>
75
-
76
- ## Methods
77
-
78
- **We don't propose any novel methods of probing the representations of neural networks. Instead we take the existing works and implement them in TensorFlow.**
79
-
80
- * Mean attention distance [1, 4]
81
- * Attention Rollout [5]
82
- * Visualization of the learned projection filters [1]
83
- * Visualization of the learned positioanl embeddings
84
- * Attention maps from individual attention heads [3]
85
- * Generation of attention heatmaps from videos [3]
86
-
87
- Another interesting repository that also visualizes ViTs in PyTorch: https://github.com/jacobgil/vit-explain.
88
-
89
-
90
- ## Notes
91
-
92
- We first implemented the above-mentioned architectures in TensorFlow and then we populated the pre-trained parameters into them using the official codebases. In order to validate this, we evaluated the implementations on the ImageNet-1k validation set and ensured that the reported top-1 accuracies matched.
93
-
94
- We value the spirit of open-source. So, if you spot any bugs in the code or see a scope for improvement don't hesitate to open up an issue or contribute a PR. We'd very much appreciate it.
95
-
96
-
97
  ## References
98
 
99
  [1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: [https://arxiv.org/abs/2010.11929](https://arxiv.org/abs/2010.11929)
@@ -102,10 +31,7 @@ We value the spirit of open-source. So, if you spot any bugs in the code or see
102
 
103
  [3] DINO: https://arxiv.org/abs/2104.14294
104
 
105
- [4] Do Vision Transformers See Like Convolutional Neural Networks?: [https://arxiv.org/abs/2108.08810](https://arxiv.org/abs/2108.08810)
106
-
107
- [5] [Quantifying Attention Flow in Transformers](https://arxiv.org/abs/2005.00928)
108
-
109
  ## Acknowledgements
110
 
111
  - [PyImageSearch](https://pyimagesearch.com)
 
9
 
10
  | DINO | Dog |
11
  | :--: | :--: |
12
+ | ![dino](https://i.imgur.com/kAShjbs.gif) | ![dog](https://i.imgur.com/xrMkCbg.gif) |
 
 
 
13
 
14
  _By [Aritra Roy Gosthipaty](https://github.com/ariG23498) and [Sayak Paul](https://github.com/sayakpaul) (equal contribution)_
15
 
16
+ <br>
17
  We probe into the representations learned by different families of Vision Transformers (supervised pre-training with ImageNet-21k, ImageNet-1k, distillation, self-supervised pre-training):
18
 
19
  * Original ViT [1]
 
22
 
23
  We hope these tools will prove to be useful for the community. Please follow along with [this post on keras.io](https://keras.io/examples/vision/probing_vits/) for a better navigation through the repository.
24
 
25
+ <br>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## References
27
 
28
  [1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: [https://arxiv.org/abs/2010.11929](https://arxiv.org/abs/2010.11929)
 
31
 
32
  [3] DINO: https://arxiv.org/abs/2104.14294
33
 
34
+ <br>
 
 
 
35
  ## Acknowledgements
36
 
37
  - [PyImageSearch](https://pyimagesearch.com)