Papers
arxiv:2405.15738

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Published on May 24
· Submitted by akhaliq on May 27
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.

Community

Nice paper🔥

·
Paper author

It's very nice of you. Thank you very much!

@ChunjiangGe would you like to build a Space? we can sponsor it with ZeroGPU!

·
Paper author

It's very nice of you. We are testing our model for deplyment. We are glad to build a Space after that.

There's a simple-english summary of the paper here - feedback is welcome! https://www.aimodels.fyi/papers/arxiv/convllava-hierarchical-backbones-as-visual-encoder-large

Sign up or log in to comment

Models citing this paper 12

Browse 12 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.15738 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.15738 in a Space README.md to link it from this page.

Collections including this paper 12