How to only use the text and visual embedding?

#2
by Labmem009 - opened

Interesting work! I want to use the alignment between images and text in the encoder of this model for downstream tasks. How should I use it?

Sign up or log in to comment