by VictorSanh

Cool release!

The blogpost mentions "H-Former integrates a dual-network design to learn both local and global features for vision-language alignment". can you say more about the local and global features and how they are computed/combined?
I could not parse that info just reading the code. It looks like that the H-former is essentially a perceiver module? But I could be wrong.

Hi @VictorSanh , thank you for your interest in our work! We will provide more details in our technical report, which may take some time to prepare. We will inform you once the report is ready.

