Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
visheratin 
posted an update Feb 24
Post
VLMs have a resolution problem, which prevents them from finding small details in large images. In my community blog post, I discuss the ways to solve it and describe the details of MC-LLaVA architecture - https://huggingface.co/blog/visheratin/vlm-resolution-curse

Check it out, and let me know what you think!

Seems Fun to explore can you link some reference papers too if possible?

·

There are links to existing papers in the blog post if you want to dive into the field.

so good!

hi @visheratin , do you have any guides on how to train similar model? Phi-2 + SigLIP vision encoder?

·

I used mainly the LLaVA training codebase with some changes to support multi-crop. I'll be working on the next post about fine-tuning MC-LLaVA on a task-specific dataset and will open all the code.

I found your blog post really interesting.
I have a question regarding training models: in your method, you mentioned that images are divided into max_crop patches and then fed into an image encoder. Does this mean that, compared to the original LLaVA, the forward pass of the model requires max_crop times more time or memory consumption?
Or is there a more efficient way to implement this?

·

You are right. The method requires multiple passes for the vision encoder, which increases memory usage. This is not such a big problem during inference, but it makes training harder because of the gradients stored. At the moment, I don't have a solution to make it more efficient. But this is an ongoing project, so maybe I will find one =)