Papers
arxiv:2403.07508

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Published on Mar 12
· Featured in Daily Papers on Mar 13

Abstract

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

Community

Interesting architectural choices:

  1. Have you tried just feeding in the auxiliary feature into the LLM without compression? Is the compression of all auxiliary features down to just 64 tokens mainly for efficiency?

  2. Have you tried just concatenating all of the features together (marked by begin/end boundary tokens for e.g.) instead of using cross attention in each of the "experts"?

·
Paper author
edited Mar 14

Thanks for interests in our work and for insightful questions!

  1. According to the paper "Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study", we gained the insight that just feeding auxiliary information without compression leads to peformance degradation. We think that it is becuase wrong information get from external computer vision models seems to damage outputs of MoAI directly. As you know, not only external CV models but also other models are not perfect models to predict completely due to many reasons. Therefore, the referred paper assigns learnable parameters to the auxiliary information. The main purpose of compression is for efficiency, of course, but moreover, we expect that MoAI-Compressor corrects wrong information or eliminates non-relevant information for vision language tasks.

  2. We can answer the question by the two types. In case of not using the compressed 64 tokens, auxiliary tokens are, on average, 1500 tokens. Therefore, training MoAI with image tokens + auxiliary tokens + language tokens brings in heavy burden, and wrong information can damage the output of MoAI. On the other hand, in case of using the compressed ones, just concatenating all features you mentioned can be an appropriate infusion strategy only when learning transformer decoder block with supervised fine tuning, LoRA, and QLoRA. However, we want to connect the original transformer decoder block with MoAI-Mixer and train only MoAI-Mixer, instead of directly tuning the transformer decoder. This is because we believe this plug-in-play infusion strategy can boost the utilization of LLMs without editing the original ones. Furthermore, the purpose of employing MoE is based on its effectiveness on the paper "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models" that provided a key in how to effectively harmonize auxiliary features with visual and language features, where the self-attended and cross-attended features are expected to independently capture various aspects compared with the jointly concatenated features.

Your helpful questions improved the contributions of the current paper and the future.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

deleted
This comment has been hidden

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2403.07508 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.07508 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2403.07508 in a Space README.md to link it from this page.

Collections including this paper 28