I love vision language models ๐ My favorite is KOSMOS-2, because it's a grounded model (it doesn't hallucinate). In this demo you can, - ask a question about the image, - do detailed/brief captioning, - localize the objects! ๐คฏ It's just amazing for VLM to return bounding boxes ๐คฉ Try it here merve/kosmos2