Papers
arxiv:2404.01911
VLRM: Vision-Language Models act as Reward Models for Image Captioning
Published on Apr 2
Authors:
Abstract
In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are available at https://huggingface.co/sashakunitsyn/vlrm-blip2-opt-2.7b.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2404.01911 in a dataset README.md to link it from this page.
Spaces citing this paper 2
Collections including this paper 0
No Collection including this paper
Add this paper to a
collection
to link it from this page.