Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -9,13 +9,11 @@ pinned: false
|
|
9 |
|
10 |
| DINO | Dog |
|
11 |
| :--: | :--: |
|
12 |
-
|  |  |
|
13 |
-
|
14 |
-
|
15 |
-
[](https://github.com/tensorflow/tensorflow/releases/tag/v2.8.0)
|
16 |
|
17 |
_By [Aritra Roy Gosthipaty](https://github.com/ariG23498) and [Sayak Paul](https://github.com/sayakpaul) (equal contribution)_
|
18 |
|
|
|
19 |
We probe into the representations learned by different families of Vision Transformers (supervised pre-training with ImageNet-21k, ImageNet-1k, distillation, self-supervised pre-training):
|
20 |
|
21 |
* Original ViT [1]
|
@@ -24,76 +22,7 @@ We probe into the representations learned by different families of Vision Transf
|
|
24 |
|
25 |
We hope these tools will prove to be useful for the community. Please follow along with [this post on keras.io](https://keras.io/examples/vision/probing_vits/) for a better navigation through the repository.
|
26 |
|
27 |
-
|
28 |
-
## Self-attention visualization
|
29 |
-
|
30 |
-
| Original Image | Attention Maps | Attention Maps Overlayed |
|
31 |
-
| :--: | :--: | :--: |
|
32 |
-
|  |  |  |
|
33 |
-
|
34 |
-
https://user-images.githubusercontent.com/36856589/162609884-8e51156e-d461-421d-9f8a-4d4e48967bd6.mp4
|
35 |
-
|
36 |
-
<small><a href=https://www.pexels.com/video/a-computer-generated-walking-dinosaur-4096297/>Original Video Source</a></small>
|
37 |
-
|
38 |
-
https://user-images.githubusercontent.com/36856589/162609907-4e432dc4-a731-40f4-9a20-94e0c8f648bc.mp4
|
39 |
-
|
40 |
-
<small><a href=https://www.pexels.com/video/a-dog-running-in-a-grass-field-4166343/>Original Video Source</a></small>
|
41 |
-
|
42 |
-
## Supervised salient representations
|
43 |
-
|
44 |
-
In the [DINO](https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training/) blog post, the authors show a video with the following caption:
|
45 |
-
|
46 |
-
> The original video is shown on the left. In the middle is a segmentation example generated by a supervised model, and on the right is one generated by DINO.
|
47 |
-
|
48 |
-
A screenshot of the video is as follows:
|
49 |
-
|
50 |
-
<img width="764" alt="image" src="https://user-images.githubusercontent.com/36856589/162615199-b5133e51-460e-4864-a83e-5b8007339ff7.png"><br>
|
51 |
-
|
52 |
-
|
53 |
-
We obtain the attention maps generated with the supervised pre-trained model and find that they are not that salient w.r.t the DINO model. We observe a similar behaviour in our experiments as well. The figure below shows the attention heatmaps extracted with
|
54 |
-
a ViT-B16 model pre-trained (supervised) using ImageNet-1k:
|
55 |
-
|
56 |
-
| Dinosaur | Dog |
|
57 |
-
| :--: | :--: |
|
58 |
-
|  |  |
|
59 |
-
|
60 |
-
We used this [Colab Notebook](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/vitb16-attention-maps-video.ipynb) to conduct this experiment.
|
61 |
-
|
62 |
-
## Hugging Face Spaces
|
63 |
-
|
64 |
-
You can now probe into the ViTs with your own input images.
|
65 |
-
|
66 |
-
| Attention Heat Maps | Attention Rollout |
|
67 |
-
| :--: | :--: |
|
68 |
-
| [](https://huggingface.co/spaces/probing-vits/attention-heat-maps) | [](https://huggingface.co/spaces/probing-vits/attention-rollout) |
|
69 |
-
|
70 |
-
## Visualizing mean attention distances
|
71 |
-
|
72 |
-
<div align="center">
|
73 |
-
<img src="./assets/vit_base_i21k_patch16_224.png" width=450/>
|
74 |
-
</div>
|
75 |
-
|
76 |
-
## Methods
|
77 |
-
|
78 |
-
**We don't propose any novel methods of probing the representations of neural networks. Instead we take the existing works and implement them in TensorFlow.**
|
79 |
-
|
80 |
-
* Mean attention distance [1, 4]
|
81 |
-
* Attention Rollout [5]
|
82 |
-
* Visualization of the learned projection filters [1]
|
83 |
-
* Visualization of the learned positioanl embeddings
|
84 |
-
* Attention maps from individual attention heads [3]
|
85 |
-
* Generation of attention heatmaps from videos [3]
|
86 |
-
|
87 |
-
Another interesting repository that also visualizes ViTs in PyTorch: https://github.com/jacobgil/vit-explain.
|
88 |
-
|
89 |
-
|
90 |
-
## Notes
|
91 |
-
|
92 |
-
We first implemented the above-mentioned architectures in TensorFlow and then we populated the pre-trained parameters into them using the official codebases. In order to validate this, we evaluated the implementations on the ImageNet-1k validation set and ensured that the reported top-1 accuracies matched.
|
93 |
-
|
94 |
-
We value the spirit of open-source. So, if you spot any bugs in the code or see a scope for improvement don't hesitate to open up an issue or contribute a PR. We'd very much appreciate it.
|
95 |
-
|
96 |
-
|
97 |
## References
|
98 |
|
99 |
[1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: [https://arxiv.org/abs/2010.11929](https://arxiv.org/abs/2010.11929)
|
@@ -102,10 +31,7 @@ We value the spirit of open-source. So, if you spot any bugs in the code or see
|
|
102 |
|
103 |
[3] DINO: https://arxiv.org/abs/2104.14294
|
104 |
|
105 |
-
|
106 |
-
|
107 |
-
[5] [Quantifying Attention Flow in Transformers](https://arxiv.org/abs/2005.00928)
|
108 |
-
|
109 |
## Acknowledgements
|
110 |
|
111 |
- [PyImageSearch](https://pyimagesearch.com)
|
|
|
9 |
|
10 |
| DINO | Dog |
|
11 |
| :--: | :--: |
|
12 |
+
|  |  |
|
|
|
|
|
|
|
13 |
|
14 |
_By [Aritra Roy Gosthipaty](https://github.com/ariG23498) and [Sayak Paul](https://github.com/sayakpaul) (equal contribution)_
|
15 |
|
16 |
+
<br>
|
17 |
We probe into the representations learned by different families of Vision Transformers (supervised pre-training with ImageNet-21k, ImageNet-1k, distillation, self-supervised pre-training):
|
18 |
|
19 |
* Original ViT [1]
|
|
|
22 |
|
23 |
We hope these tools will prove to be useful for the community. Please follow along with [this post on keras.io](https://keras.io/examples/vision/probing_vits/) for a better navigation through the repository.
|
24 |
|
25 |
+
<br>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
## References
|
27 |
|
28 |
[1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: [https://arxiv.org/abs/2010.11929](https://arxiv.org/abs/2010.11929)
|
|
|
31 |
|
32 |
[3] DINO: https://arxiv.org/abs/2104.14294
|
33 |
|
34 |
+
<br>
|
|
|
|
|
|
|
35 |
## Acknowledgements
|
36 |
|
37 |
- [PyImageSearch](https://pyimagesearch.com)
|