arxiv:2402.03766

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Published on Feb 6

· Submitted by

akhaliq on Feb 7

Authors:

,

,

,

,

,

,

,

,

,

Bo Zhang ,

Abstract

We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

View arXiv page View PDF Add to collection

Community

Feb 8

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Apr 1

This comment has been hidden

Apr 1

This comment has been hidden

Apr 1

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.03766 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 7