more details about local and global features

#1
by VictorSanh - opened

Cool release!

The blogpost mentions "H-Former integrates a dual-network design to learn both local and global features for vision-language alignment". can you say more about the local and global features and how they are computed/combined?
I could not parse that info just reading the code. It looks like that the H-former is essentially a perceiver module? But I could be wrong.

Hi @VictorSanh , thank you for your interest in our work! We will provide more details in our technical report, which may take some time to prepare. We will inform you once the report is ready.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment