Model,Language Model,Model Size,Evaluation Method,Avg. P1,Avg. P2,Avg. P3,Scene Understanding,Instance Identity,Instance Attribute,Instance Location,Instance Counting,Spatial Relation,Instance Interaction,Visual Reasoning,Text Recognition,Celebrity Recognition,Landmark Recognition,Chart Understanding,Visual Referring Expression,Science Knowledge,Emotion Recognition,Visual Mathematics,Difference Spotting,Meme Comprehension,Global Video Understanding,Action Recognition,Action Predicion,Procedure Understanding,In-Context Captioning,Interleaved Image-Text Analysis,Text-to-Image Generation,Next Image Prediction,Text-Image Creation [BLIP-2](https://github.com/salesforce/LAVIS),Flan-T5-XL,3B,PPL,41.0,35.3,0.0,58.5,48.6,49.0,39.1,43.4,36.2,48.5,52.9,60.7,51.8,51.4,19.2,43.2,52.4,29.3,22.0,17.8,38.6,42.5,37.7,36.2,22.9,40.0,30.6,0.0,0.0,0.0 [InstructBLIP](https://github.com/salesforce/LAVIS),Flan-T5-XL,3B,PPL,42.2,35.7,0.0,58.9,49.7,61.7,35.1,58.1,34.9,47.4,55.9,61.4,48.5,45.4,26.4,41.7,47.7,34.5,21.2,22.8,35.2,41.5,36.1,40.5,24.5,36.7,34.7,0.0,0.0,0.0 [InstructBLIP-Vicuna](https://github.com/salesforce/LAVIS),Vicuna-7B,7B,PPL,41.4,29.7,0.0,53.6,43.9,49.0,37.8,56.5,35.8,43.3,56.2,57.2,60.3,44.4,27.9,39.2,39.4,23.0,26.5,36.5,55.4,40.4,38.6,31.2,15.6,26.7,32.7,0.0,0.0,0.0 [LLaVA](https://github.com/haotian-liu/LLaVA),LLaMA-7B,7B,PPL,38.7,30.2,0.0,53.8,47.5,38.3,34.2,42.0,34.7,40.2,52.9,46.4,51.8,45.6,30.3,40.2,37.6,34.3,20.5,27.0,50.0,44.1,36.2,25.1,18.6,40.0,20.4,0.0,0.0,0.0 [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4),Vicuna-7B,7B,PPL,39.4,34.1,0.0,56.3,49.2,45.8,37.9,45.3,32.6,47.4,57.1,41.8,55.2,45.2,20.2,41.2,43.3,24.2,25.0,19.0,46.7,39.0,38.7,27.4,28.6,45.8,22.5,0.0,0.0,0.0 [VPGTrans](https://github.com/VPGTrans/VPGTrans),LLaMA-7B,7B,PPL,36.2,23.9,0.0,46.9,38.6,33.6,35.6,27.5,34.4,33.0,50.8,47.6,52.4,38.2,30.1,34.7,36.1,31.5,27.3,24.6,44.0,37.8,38.2,20.9,33.5,19.2,28.6,0.0,0.0,0.0 [MultiModal-GPT](https://github.com/open-mmlab/Multimodal-GPT),LLaMA-7B,7B,PPL,37.4,34.9,0.0,46.9,42.5,32.0,32.3,27.7,29.7,29.9,48.3,35.2,60.9,50.4,24.2,42.2,37.6,32.1,27.3,40.1,56.5,37.6,38.7,25.3,24.4,39.2,30.6,0.0,0.0,0.0 [Otter](https://github.com/Luodian/Otter),LLaMA-7B,7B,PPL,36.4,36.6,0.0,45.9,39.7,31.9,31.6,26.4,32.0,33.0,49.2,39.3,59.7,53.0,23.6,41.2,36.1,37.3,22.0,27.4,46.7,36.6,37.9,26.0,24.8,42.5,30.6,0.0,0.0,0.0 [OpenFlamingo](https://github.com/mlfoundations/open_flamingo),LLaMA-7B,7B,PPL,37.3,35.5,0.0,46.7,42.3,31.7,33.4,27.4,29.8,29.9,47.7,35.6,60.3,49.8,24.2,42.2,39.0,32.1,27.3,39.9,54.9,37.6,38.4,25.2,24.1,38.3,32.7,0.0,0.0,0.0 [LLaMA-AdapterV2](https://github.com/OpenGVLab/LLaMA-Adapter),LLaMA-7B,7B,PPL,37.5,0.0,0.0,45.2,38.5,29.3,33.0,29.7,35.5,39.2,52.0,48.7,58.5,46.4,24.2,41.2,40.1,39.7,23.5,29.1,52.2,41.9,38.2,18.8,20.3,0.0,0.0,0.0,0.0,0.0 [GVT](https://github.com/TencentARC/GVT),Vicuna-7B,7B,PPL,34.4,38.6,0.0,41.7,35.5,31.8,29.5,36.2,32.0,32.0,51.1,35.2,39.4,36.4,25.0,36.2,31.1,20.6,22.7,41.5,59.2,40.4,29.7,26.3,24.1,42.5,34.7,0.0,0.0,0.0 [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl),LLaMA-7B,7B,PPL,39.4,28.9,0.0,49.7,45.3,32.5,36.7,27.3,32.7,44.3,54.7,49.2,70.9,49.6,23.2,44.2,44.0,32.5,23.5,33.5,54.9,42.0,37.8,18.3,19.3,29.2,28.6,0.0,0.0,0.0 [Kosmos-2](https://github.com/microsoft/unilm/tree/master/kosmos-2),Decoder only 1.3B,1.3B,PPL,46.3,23.3,0.0,63.4,57.1,58.5,44.0,41.4,37.9,55.7,60.7,68.1,82.1,51.4,21.2,48.2,43.7,30.7,28.0,25.2,42.8,48.5,40.8,39.5,30.0,24.2,22.5,0.0,0.0,0.0 [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat),Qwen-7B,7B,PPL,43.1,35.5,0.0,56.5,47.6,54.8,46.9,54.2,40.3,55.7,55.0,47.4,62.4,55.6,25.2,43.7,41.2,20.6,28.8,34.3,47.2,39.7,42.8,29.6,19.1,42.5,28.6,0.0,0.0,0.0 [LLaVA-1.5](https://github.com/haotian-liu/LLaVA),vicuna-7B,7B,PPL,47.3,30.8,0.0,63.7,62.4,66.7,51.3,60.2,38.5,47.4,59.8,69.0,60.6,49.8,25.0,45.7,56.7,31.1,24.2,35.7,50.3,46.1,39.4,29.4,28.1,39.2,22.5,0.0,0.0,0.0 [IDEFICS-9b-instruct](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct),LLaMA-7B,7B,PPL,38.0,40.3,0.0,48.2,38.2,37.8,32.9,29.0,32.4,37.1,54.1,45.5,52.4,52.8,22.6,42.7,33.2,26.6,21.2,56.5,48.4,42.7,38.6,23.6,20.5,45.8,34.7,0.0,0.0,0.0 [InternLM-XComposer-VL](https://github.com/InternLM/InternLM-XComposer),InternLM-7B,7B,PPL,59.2,32.1,0.0,74.8,70.5,67.6,60.5,55.3,53.4,76.3,76.1,61.4,86.1,78.0,27.2,60.3,84.8,68.9,25.8,47.7,56.6,58.6,49.9,37.6,24.9,27.5,36.7,0.0,0.0,0.0 [Emu](https://github.com/baaivision/Emu),LLaMA-13B,13B,PPL,42.5,41.1,41.4,59.0,50.0,43.7,37.1,44.3,33.6,49.5,58.3,61.4,68.8,61.6,19.0,45.7,41.5,24.2,26.4,29.3,37.1,41.9,42.7,37.9,21.8,51.7,30.6,46.8,43.2,34.2 [Next-GPT](https://github.com/NExT-GPT/NExT-GPT),vicuna-7B,7B,PPL,30.7,35.6,33.9,36.4,35.1,25.6,29.9,36.1,30.9,39.2,41.7,31.0,30.9,27.4,21.2,34.2,31.8,24.4,17.4,24.2,39.0,35.5,33.8,25.6,24.5,46.7,24.5,45.1,19.8,36.7 [SEED-LLaMA](https://github.com/AILab-CVC/SEED),LLaMA2-Chat-13B,13B,PPL,43.9,43.4,52.3,64.0,55.0,51.3,45.4,43.3,37.9,56.7,59.2,57.0,55.5,52.8,18.8,49.3,44.8,28.8,24.4,29.5,41.5,46.7,39.4,43.9,20.3,54.2,32.7,50.2,40.7,65.8 [GPT-4V](https://openai.com/research/gpt-4v-system-card),\-,-,Generate,68.1,44.2,0.0,77.5,73.9,70.6,61.8,56.8,56.9,74.2,78.5,57.6,91.8,97.4,45.1,71.9,66.1,71.1,43.9,67.9,89.3,64.5,65.7,51.7,63.4,29.2,59.2,0.0,0.0,0.0 [VideoChat](https://github.com/OpenGVLab/Ask-Anything),Vicuna-7B,7B,PPL,37.0,35.3,0.0,44.3,40.7,32.2,36.9,32.9,32.6,42.3,51.1,45.7,35.2,46.8,20.6,43.2,39.4,34.3,19.7,30.3,51.6,41.5,34.0,30.6,27.4,40.0,30.6,0.0,0.0,0.0 [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT),LLaMA-7B,7B,PPL,36.4,31.0,0.0,44.1,37.0,35.8,30.7,44.2,31.1,29.9,49.9,39.8,49.7,40.6,22.0,33.2,37.2,22.4,25.0,46.1,61.4,42.6,32.2,27.0,19.0,37.5,24.5,0.0,0.0,0.0 [Valley](https://github.com/RupertLuo/Valley),LLaMA-13B,13B,PPL,34.5,32.2,0.0,45.3,36.4,33.7,30.6,27.1,31.5,35.1,52.0,35.2,44.9,43.4,23.8,33.2,37.2,26.0,22.7,37.1,52.2,31.5,32.1,21.9,26.5,35.8,28.6,0.0,0.0,0.0