Improved Baselines with Visual Instruction Tuning

Published on Oct 5, 2023
ยท Featured in Daily Papers on Oct 6, 2023


Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Paper author

Check out our LLaVA-1.6 blog post as well!

LLaVA-1.6: Improved reasoning, OCR, and world knowledge


There is also an updated technical report here: Improved Baselines with Visual Instruction Tuning

Sign up or log in to comment

Models citing this paper 13

Browse 13 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 23

Collections including this paper 16