Papers
arxiv:2406.11832

Unveiling Encoder-Free Vision-Language Models

Published on Jun 17
Ā· Submitted by akhaliq on Jul 8
#1 Paper of the day
Authors:
,
,
,

Abstract

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE.

Community

Paper submitter
Paper author

Highlights: (arxiv: https://arxiv.org/abs/2406.11832)

šŸ”„ Superior Capability: An originated encoder-free LVLM with arbitrary image aspect ratio, outperforming the counterpart Fuyu-8B and approaching existing modular encoder-based LVLMs;

šŸ”„ Data Efficiency: Filter solely 33M publicly avaliable data from OpenImages, SAM, LAION for pre-training; Utilizing 665K LLaVA SFT data for EVE-7B, and extra 1.2M SFT data for EVE-7B (HD);

šŸ”„ Training Efficiency: Trained with two 8-A100 (40G) nodes in ~9 days or four 8-A100 nodes in ~5 days;

šŸ”„ Pioneering Route: We attempt to provide an efficient, transparent, and practical training strategy and procedure for developing a pure decoder-only architecture across modalities.

why do you use vicuna, it is so old...

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.11832 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.11832 in a Space README.md to link it from this page.

Collections including this paper 17