Getting a list of all detected items on a picture?

#10
by Marcophono - opened

Hello!
The standard answer to my question probably would be: Use an object detection model! Sure, I tried that. But they all are very, very weak compared with BLIP 2. For example a pillow in form of Spongebob is detected as "pillow". In best case as "yellow pillow". BLIP 2 knows exactly what it is: "A pillow in form of Spongebob." BUT: BLIP 2 is a bit lazy and only outputs the main features from an image. If I would ask "What is laying on the bed?" it would give the desired answer. But I would prefer to get more informations about all detected items. Is that possible? To increase the min_length doesn't help. It begins to hallucinate long before it would point to other detected items.
I use blip2-flan-t5-xxl, load in 8 bit.

Best regards - and thank you for this wonderful piece of code to connect image and text models!
Marc

Model-wise BLIP2 can't be used on-the-fly for this purpose. The use case you described would typically require a procedure to generate regions-of-interest, which BLIP2 does not provide out-of-the-box.

Yet, a bit engineering may help. For example, you can generate bounding boxes and labels using off-the-shelf detectors, and then ask BLIP2 to enrich the description, e.g. by prompting or feeding the region crop for captioning.

That's a good idea, thank you very much, @dxli !
What I already tried was to ask "What do you see in the top right corner?" This sometimes helps but your suggesstions is much better! I will try that out!
Best regards
Marc

@dxli : Addon question: What is the recommended minimum size of such tiles? (or for images for BLIP 2 generally) I cannot remember to have read this anywhere.

We have random augmentation during training. So I'd expect it to work fine for reasonably sized patches. Yet there is no definite answer. You may want to experiment and see.

dxli changed discussion status to closed

Sign up or log in to comment