Papers
arxiv:2310.03744

Improved Baselines with Visual Instruction Tuning

Published on Oct 5, 2023
ยท Featured in Daily Papers on Oct 6, 2023
Authors:
,

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Paper author

Check out our LLaVA-1.6 blog post as well!

LLaVA-1.6: Improved reasoning, OCR, and world knowledge
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/

Demo: https://llava.hliu.cc/

There is also an updated technical report here: Improved Baselines with Visual Instruction Tuning

Sign up or log in to comment

Models citing this paper 13

Browse 13 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 20

Collections including this paper 16