Explanation of the method

#1
by giacomov - opened

Cool space! @EduardoPacheco do you have any pointer that explains the methodology used?

I looked into the app.py expecting to see the extraction of the attention maps in the last layer, instead I find this rather obscure piece of code:

with torch.no_grad():
    out = dino.forward_features(img_tensor)

features = out["x_prenorm"][:, 1:, :]

What does this last line do? what is "x_prenorm" and why are we skipping the first element of the second dimension? Is that the CLS token?

Thanks for your work!

Hey @giacomov , I'm using the original implementation that the authors provided in this Space through torch.hub. You can take a look at forward_features here

TL;DR

  • forward_features passes the input tensor through the ViT model
  • x_prenorm is the last hidden state from the ViT, but without passing it through the LayerNorm
  • Precisely, we skip the first token because we need the image token embeddings to make the visualizations
  • Regarding the double PCA method that I've used was something mentioned in the paper, but also people discussed about it in the repo issues here is a good discussion

Sign up or log in to comment