|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# Model description |
|
`xGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. |
|
|
|
In the v1.5 (08/2024) release, we present a series of XGen-MM models including: |
|
- [π€ xGen-MM-instruct-interleave (our main instruct model)](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-multi-r-v1.5): `xgen-mm-phi3-mini-instruct-interleave-r-v1.5` |
|
- This model has higher overall scores than [xGen-MM-instruct](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-singleimg-r-v1.5) on both single-image and multi-image benchmarks. |
|
- [π€ xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5` |
|
- [π€ xGen-MM-instruct](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-singleimg-r-v1.5): `xgen-mm-phi3-mini-instruct-singleimg-r-v1.5` |
|
- [π€ xGen-MM-instruct-dpo](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-dpo-r-v1.5): `xgen-mm-phi3-mini-instruct-dpo-r-v1.5` |
|
|
|
For more details, check out our [tech report](https://arxiv.org/pdf/2408.08872), [fine-tuning code](https://github.com/salesforce/LAVIS/tree/xgen-mm), and project page (coming soon). |
|
|
|
# Results |
|
|
|
### Single-image benchmarks |
|
|
|
| Model (Size) | SEED -IMG | SEED v2 | MMB (dev) | MM Star | MME (norm) | CVB -2D | CVB -3D | RealW QA | MMMU (val) | Math Vista | Sci QA | POPE | Text VQA | Avg. all | Avg. perc. | |
|
|--------------------------------|:---------:|:-------:|:----------:|:-------:|:-----------:|:-------:|:-----------------:|-------------------|:-----------------:|:-----------------:|:-----------------:|:-----------------:|----------------|:--------------:|----------------| |
|
| Closed-source models | | | | | | | | | | | | | | | | |
|
| GPT-4V<sup>*</sup> | 72.0 | - | 80.8 | 49.7 | 63.3 | 64.3 | 73.8 | 56.5 | 53.8 | 48.2 | 82.1 | 75.4 | - | - | - | |
|
| MM1-3B-Chat (3B) | 68.8 | - | 67.8 | - | 62.9 | - | - | - | 33.9 | - | - | 87.4 | - | - | - | |
|
| Open-source models | | | | | | | | | | | | | | | | |
|
| HPT-1.5-edge (4B) | **72.3** | - | 74.6 | 45.8 | - | - | - | - | 42.6 | **45.1** | 85.4 | **91.0** | - | - | - | |
|
| VILA-1.5-3B (3B) | 67.9 | - | 63.4 | - | - | - | - | - | 33.3 | - | 69.0 | 85.9 | - | - | - | |
|
| VILA-1.5-3B<sup>**</sup> (3B) | 67.9 | 51.9 | 62.4 | 40.3 | 58.5 | 50.1 | 60.3 | 53.3 | 34.1 | 30.6 | 68.9 | 86.9 | 58.1 | 55.6 | 59.1 | |
|
| phi-3-vision (4B) | - | - | 80.5 | - | - | - | - | - | - | 44.5 | 90.8 | 85.8 | 70.9 | - | - | |
|
| phi-3-vision<sup>**</sup> (4B) | 71.0 | 52.7 | 74.2 | <u>47.9</u> | 55.3 | 60.7 | 68.2 | 59.1 | **46.1** | **45.1** | **90.2** | 83.5 | **73.3** | 63.6 | 63.6 | |
|
| **<u>xGen-MM-inst. (4B)</u>** | 71.8 | <u>53.9</u> | <u>76</u> | 46.7 | <u>63.8</u> | <u>66.2</u> | **75.4** | **61.6** | <u>42.8</u> | 39.2 | 85.6 | 87.0 | <u>72.0</u> | <u>64.8</u> | <u>66.9</u> | |
|
| xGen-MM-inst.-interleave (4B) | <u>72.2</u> | **55.5** | **76.8** | **48.1** | **64.4** | **69.3** | <u>72.3</u> | <u>60.5</u> | 41.1 | <u>39.6</u> | <u>88.3</u> | 87.0 | 71.0 | **65.1** | **67.3** | |
|
|
|
* GPT-4V(gpt-4-1106-preview) results are taken from this third-party [leaderborad](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard). |
|
** Model results are tested with our evaluation code for a fair comparison. |
|
|
|
|
|
# How to use |
|
|
|
Please check out our [inference notebook](demo.ipynb) for example code to use our model. We also provide an example script for [batch inference](batch_inference.ipynb). |
|
|
|
# Reproducibility: |
|
|
|
Our evaluation is implemented based on [open-compass/VLMEvalKit](https://github.com/open-compass/VLMEvalKit). We will create a PR to that repo to support XGen-MM evaluation. |
|
|
|
|
|
# Bias, Risks, Limitations, and Ethical Considerations |
|
The main data sources are from the internet, including webpages, |
|
image stock sites, and curated datasets released by the research community. We have excluded certain data, such as LAION, due to known CSAM concerns. |
|
The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. |
|
We strongly recommend users assess safety and fairness before applying to downstream applications. |
|
|
|
|
|
# License |
|
|
|
Our code and weights are released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) license. |
|
|
|
# Code acknowledgment |
|
Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA). |
|
The evaluation code for the instruct models is based on [VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LVLMs)](https://github.com/open-compass/VLMEvalKit). |
|
|
|
We thank the authors for their open-source implementations. |
|
|
|
|
|
# Citation |
|
``` |
|
@misc{blip3-xgenmm, |
|
author = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu}, |
|
title = {xGen-MM (BLIP-3): A Family of Open Large Multimodal Models}, |
|
year = {2024}, |
|
eprint = {2408.08872}, |
|
archivePrefix = {arXiv}, |
|
primaryClass = {cs.CV}, |
|
url = {https://arxiv.org/abs/2408.08872}, |
|
} |
|
``` |
|
|
|
# Troubleshoot |
|
|
|
1. If you missed any packages, please consider the following |
|
|
|
``` |
|
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121 |
|
pip install open_clip_torch==2.24.0 |
|
pip install einops |
|
pip install einops-exts |
|
pip install transformers==4.41.1 |
|
``` |