Merve Noyan PRO

merve

AI & ML interests

VLMs, vision & co

Articles

Organizations

Posts 31

view post
Post
1553
Do we fully leverage ViT encoders in vision language models?

A new paper (by @HuanjinYao et al) built a dense connector that does it better! HuanjinYao/DenseConnector-v1.5-8B
HuanjinYao/denseconnector-66500e173fc8c9f05dc98dea

VLMs consist of an image encoder block, a projection layer that projects image embeddings to text embedding space and then a text decoder sequentially connected πŸ“–
This paper explores using intermediate states of image encoder and not a single output 🀩
The authors explore three different ways of instantiating dense connector: sparse token integration, sparse channel integration and dense channel integration. (see paper on how they do it Dense Connector for MLLMs (2405.13800))

They explore all three of them integrated to LLaVA 1.5 and found out each of the new models are superior to the original LLaVA 1.5 πŸ₯Ή I tried the model and it seems to work very well. As part of the release, the authors have released various ckpts based on different decoders (Vicuna 7/13B and Llama 3-8B) that you can find in the collection πŸ€—

view post
Post
952
We will be providing ZeroGPU grants (for Spaces inference) to those who want to fine-tune PaliGemma and build a Space πŸ”₯

You can pick any dataset of your choice!

Example code: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing (you can use a lower GPU with QLoRA)

Datasets:
https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=trending
https://huggingface.co/datasets?task_categories=task_categories:image-to-text&sort=trending