Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
merveΒ 
posted an update Jul 19
Post
3164
Chameleon 🦎 by Meta is now available in Hugging Face transformers 😍
A vision language model that comes in 7B and 34B sizes 🀩
But what makes this model so special?

Demo: merve/chameleon-7b
Models: facebook/chameleon-668da9663f80d483b4c61f58

keep reading β₯₯

Chameleon is a unique model: it attempts to scale early fusion 🀨
But what is early fusion?
Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder (LLM)

Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation 😏

Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training and they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO)

This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use.

One can also do text-only prompting, authors noted the model catches up with larger LLMs (like Mixtral 8x7B or larger Llama-2 70B) and also image-pair prompting with larger VLMs like IDEFICS2-80B (see paper for the benchmarks Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818))
Thanks for reading!

Thanks for the post and your efforts to share the knowledge.

https://huggingface.co/spaces/merve/chameleon-7b -- Space does not seem to work. When I ask for a summary of an image, the result is a summary of some random table and not of the one I uploaded. Please check when you can

Β·

Can you send your inputs for reproducibility? @prasiyer

I'm sure it's another llama clone ? Did they change the code or not ?
Is it the same code copied again ?

Β·

It is a vision language model, these models use text decoders (here it's built on Llama-2 since it's another model from Meta) as a smaller part. VLMs largely differ from LLMs, if you can read the post above you can understand the difference.