Papers
arxiv:2310.03744

Improved Baselines with Visual Instruction Tuning

Published on Oct 5, 2023
ยท Submitted by akhaliq on Oct 6, 2023
#2 Paper of the day
Authors:
,

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Paper author

Check out our LLaVA-1.6 blog post as well!

LLaVA-1.6: Improved reasoning, OCR, and world knowledge
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/

Demo: https://llava.hliu.cc/

There is also an updated technical report here: Improved Baselines with Visual Instruction Tuning

Unlocking the Power of Simple Modifications in Multimodal Learning

Links ๐Ÿ”—:

๐Ÿ‘‰ Subscribe: https://www.youtube.com/@Arxflix
๐Ÿ‘‰ Twitter: https://x.com/arxflix
๐Ÿ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

This comment has been hidden
This comment has been hidden

Sign up or log in to comment

Models citing this paper 20

Browse 20 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 31

Collections including this paper 17