CoLLaVO: Crayon Large Language and Vision mOdel

Published on Feb 17
· Submitted by akhaliq on Feb 20


The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on Vision Language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in zero-shot numerous VL benchmarks.


Paper author
edited Mar 1

You can access the code of CoLLaVO-7B by

@BK-Lee would you like to host the model and the demo on Hugging Face?

Paper author

Yes! I am preparing the code first and then will upload the hosting model on huggingface space. Now, we are also preparing follow-up large language and vision model for more strong performance, so we plan to upload simultaneously. Thanks for your interest!

Paper author

CoLLaVO-7B model has been released in!

@BK-Lee great initiative with model card 🤩 looking forward to the demo!

Collections including this paper 2