@merve on Hugging Face: "Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉…"

merve

posted an update Jan 19, 2024

Post

Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉
OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first. 📝
OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training. 👀
What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together.

Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP). They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune.
During fine-tuning for object detection, they calculate the loss over bipartite matches. Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth.

OWL-ViT is very scalable. You can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need weak supervision. Moreover, only scaling the encoders creates a bottleneck after a while.

The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data.

Thanks to this, OWLv2 scaled very well and topped leaderboards on open vocabulary object detection 👑
If you'd like to try it out, I will leave couple of links with apps, notebooks and more in the comments! 🤗

merve

Jan 19, 2024

I've created a notebook for you to see how to use it with 🤗 transformers: https://colab.research.google.com/drive/10dAutMbC1ewRqgS3hqDyjOWOsJAv08E6?usp=sharing
If you want to play with it directly, you can use this Space: https://huggingface.co/spaces/merve/owlv2
All the models and the applications of OWL-series is in this collection: https://huggingface.co/collections/merve/owl-series-65aaac3114e6582c300544df

yte1008

Dec 31, 2024

Hi Merve,

First of all, thanks for this insightful post.

I wonder whether is it possible to Finetune OWLv2 for a specific task using our custom dataset in our local machines.

Can you please give some information about this topic if you have time.

Thanks in advance...

Join the conversation