MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Published on Mar 14
· Featured in Daily Papers on Mar 15


In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Here's my summary:

This paper from Apple presents MM1, a family of multimodal AI models that combine vision and language understanding. The researchers conducted extensive experiments to identify the key factors driving performance in these models, testing different architectural choices and pre-training data mixtures.

My highlights from the paper:

Big one of course: The largest MM1 model (30B dense) achieves state-of-the-art few-shot learning on multimodal benchmarks

Key points:

  • MM1 includes both dense models up to 30B parameters and mixture-of-experts (MoE) variants
  • Image resolution has the biggest impact on performance, more than model size
  • Specific vision-language connector design has little effect
  • Mixing interleaved image+text, caption, and text-only data in pre-training is crucial
  • 5:5:1 ratio of caption, interleaved, and text data works best
  • Synthetic caption data helps for few-shot learning
  • The 30B dense model beats prior SOTA on VQA and captioning tasks

The core insight is that deliberate data and architecture choices, not just scale, are key to building performant multimodal models. The MM1 models also exhibit impressive emergent abilities like multi-image reasoning and in-context few-shot learning.

Full summary here.

Amazing report. Thanks.

Wen models?

Thanks for providing vast amount of cooking receipes for building vision language model

I have one question regarding this paper.

Do you have experiments with (a) simple linear connector model without compression token number (b) the linear connector with compressed token number (c) C-abstractor that compresses the image token numbers (d) C-abstractor without compressing the token number?

I want to know additional recipe for compression of image tokens

Paper author

Good questions. I'm assuming by "compression token number" you are referring to using fewer output image tokens from the connector than it was provided as input. In this work, we only considered connectors that supported a reduction in the total number of image tokens, because we train with 16 images in each sequence at a resolution of 378x378 pixels per image. With patch size 14, this results in (378/14)^2=729 output patches for every image. Multiplied by 16 images, and this gives 11,664 image patches ("tokens") for each sequence (and we use a batch of 512 sequences per pre-training step).

This is a lot of image tokens! Instead, we explored using at most 144 tokens per image (5x reduction). This number is partially motivated by the results from the HoneyBee paper, which provides some ablations you may be interested in:

How did you choose the Empirical Setup before you conducted ablations on "image encoder" "resolution" "VL-connector" and "data composition" choices?
It quite confuses me if you choose another invariance when doing certain ablation. [The whole work is very impressive because the number of state combinations is very large]

@bmckinz How many tokens are used in pre-training? Paper says 100B tokens are used for pre-training, but from the paper, 200k (step) * 4096 (seq) * 512 (bsz) = 400B tokens seems to be used for the training.

Paper author

@Tae whoops, you are right! That's a typo. Thanks for pointing this out, it should say 400B.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 45