--- title: README emoji: 📉 colorFrom: gray colorTo: blue sdk: static pinned: false --- # All Things ViTs: Understanding and Interpreting Attention in Vision (CVPR'23 tutorial) *By: [Hila Chefer](https://hila-chefer.github.io) and [Sayak Paul](https://sayak.dev)* *Website: [atv.github.io])https://atv.github.io)* *Abstract: In this tutorial, we explore different ways to leverage attention in vision. From left to right: (i) attention can be used to explain the predictions by the model (e.g., CLIP for an image-text pair) (ii) By manipulating the attention-based explainability maps, one can enforce that the prediction is made based on the right reasons (e.g., foreground vs. background) (iii) The cross-attention maps of multi-modal models can be used to guide generative models (e.g., mitigating neglect in Stable Diffusion).* This organization hosts all the interactive demos to be presented at the tutorial. Below, you can find some of them.