Model,Large Language Model,Validation Split,Test Split BLIP-2,Flan-T5-XL,26.71,27.90 InstructBLIP,Flan-T5-XL,28.09,25.19 InstructBLIP Vicuna,Vicuna-7B,26.53,26.64 LLaVA,LLaMA-7B,27.0,28.16 MiniGPT-4,Vicuna-7B,28.11,30.93 VPGTrans,LLaMA-7B,27.38,24.12 MultiModal-GPT,Vicuna-7B, 27.81,30.43 Otter,LLaMA-7B,28.08,30.87 OpenFlamingo,LLaMA-7B,27.67,30.18 LLaMA-Adapter V2,LLaMA-7B,27.81,30.43 GVT,Vicuna-7B, 27.87,29.67 mPLUG-Owl,LLaMA-7B,27.63,31.31 mPLUG-Owl-2,LLaMA2-7B,27.84,30.37 Kosmos-2,Decoder only 1.3B,26.97,"" Qwen-VL-Chat,Qwen-7B,27.69,31.06 LLaVA-1.5,Vicuna-7B,27.81,29.80 VideoChat,Vicuna-7B,27.51,28.72 Video-ChatGPT,LLaMA-7B,27.33,29.17 Valley,LLaMA-13B,27.27,30.11 Video-LLaMA,LLaMA2-Chat-7B,28.58,30.30 SEED-LLaMA,LLaMA2-Chat-13B,29.93,"" SEED-X,LLaMA2-Chat-13B,31.07,29.92 DeepSeek-VL-Chat,DeepSeek-LLM-7B,27.57,26.01 CogVLM,Vicuna-7B,27.48,31.06 Yi-VL,Yi-6B,28.67,30.56 Xcomposer,InternLM-7B,37.17,36.36 Gemini-Pro-Vision,\-,30.46,32.39 GPT-4V,\-,37.98,37.25