Reason for using PerceiverResampler/Cross-Attention/IDEFICS related modality layers?

by besiktas - opened Jan 25

Jan 25

I was wondering why it seems you are using a few ideas/layers from the IDEFICS model like PerceiverResampler over just linear projection/modality_projection before the input_merging? Have you all found it improves results/training or is it mostly just sticking to what IDEFICS had?

Also as it seems like the vision model being used (SiglipVisionModel) is not pretrained, is there a reason for that?

VictorSanh

Jan 25

•

edited Jan 25

that's a great question!

the vision model (extracted from siglip) is pretrained, we are not starting from sratch.

as to the resampler, we found no to very minimal loss by pooling the vision hidden states into a short sequence that is then fed to the language model. the modality projection is mainly here to transform the image hidden size to the text hidden size for each of the positions

atb29

Jan 30

It would be great if it could take into account also images in the website screenshot and all the small details in it

VictorSanh

Jan 30

It would be great if it could take into account also images in the website screenshot

yes that's part of WebSight v0.2 we are working on!

VictorSanh changed discussion status to closed Jan 30

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment