more details about local and global features

by VictorSanh - opened

Cool release!

The blogpost mentions "H-Former integrates a dual-network design to learn both local and global features for vision-language alignment". can you say more about the local and global features and how they are computed/combined?
I could not parse that info just reading the code. It looks like that the H-former is essentially a perceiver module? But I could be wrong.

HyperGAI org
edited Mar 20

Hi @VictorSanh , thank you for your interest in our work! We will provide more details in our technical report, which may take some time to prepare. We will inform you once the report is ready.

Sign up or log in to comment