Merve Noyan PRO

merve

AI & ML interests

VLMs, vision & co

Articles

Organizations

Posts 44

view post
Post
2344
Chameleon 🦎 by Meta is now available in Hugging Face transformers 😍
A vision language model that comes in 7B and 34B sizes 🀩
But what makes this model so special?

Demo: merve/chameleon-7b
Models: facebook/chameleon-668da9663f80d483b4c61f58

keep reading β₯₯

Chameleon is a unique model: it attempts to scale early fusion 🀨
But what is early fusion?
Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder (LLM)

Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation 😏

Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training and they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO)

This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use.

One can also do text-only prompting, authors noted the model catches up with larger LLMs (like Mixtral 8x7B or larger Llama-2 70B) and also image-pair prompting with larger VLMs like IDEFICS2-80B (see paper for the benchmarks Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818))
Thanks for reading!
view post
Post
2682
Forget any document retrievers, use ColPali πŸ’₯πŸ’₯

Document retrieval is done through OCR + layout detection, but you are losing a lot of information in between, stop doing that! πŸ€“

ColPali uses a vision language model, which is better in doc understanding πŸ“‘
ColPali: vidore/colpali (mit license!)
Blog post: https://huggingface.co/blog/manu/colpali
The authors also released a new benchmark for document retrieval:
ViDoRe Benchmark: vidore/vidore-benchmark-667173f98e70a1c0fa4db00d
ViDoRe Leaderboard: vidore/vidore-leaderboard

ColPali marries the idea of modern vision language models with retrieval 🀝

The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali πŸ–‡οΈ
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) 🀩

The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet.
ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!