Probing Visual Language Priors in VLMs

ImageDPO Finetuned Model

This page provides the ImageDPO finetuned checkpoint for LLaVA-v1.5-13B used in Probing Visual Language Priors in VLMs. ImageDPO is a self-improving approach to enhance VLM visual reasoning performance by increasing reliance on visual inputs as illustrated in the below image. We offer the merged model weights for use.

Usage

First, install the LLaVA-v1.5 codebase.

Run the following command to have a try:

python -m llava.eval.run_llava \
    --model-path ViLP/LLaVA-v1.5-13b-ImageDPO \
    --image-file 'images/llava_logo.png' \
    --query 'Please caption this image.' \
    --conv-mode llava_v1

Citation Information

Please consider citing ViLP paper, if you find our resource helpful!

@article{luo2024probing,
      title={Probing Visual Language Priors in VLMs},
      author={Luo, Tiange and Cao, Ang and Lee, Gunhee and Johnson, Justin and Lee, Honglak},
      journal={arXiv preprint arXiv:2501.00569},
      year={2024}
}