Papers
arxiv:2409.17146

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Published on Sep 25
Β· Submitted by akhaliq on Sep 26

Abstract

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.

Community

Paper submitter

Really interesting work! but its great performance seems biased for academic dataset because AI2D and Document dataset has good performances while mathvista and mmmu for more reasoning tasks are too low compared with Qwen2-VL. I would like to see the performances of more challenging benchmarks such as MM-Vet, MM-Vet-v2, MathVerse, LLaVA-wilder, and so on.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

This comment has been hidden

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.17146 in a dataset README.md to link it from this page.

Spaces citing this paper 20

Collections including this paper 26