xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Abstract
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
Community
The link gives a 404, I assume xgen-mm hasn't been merged yet?
Hi, we plan to make the links public today. Since yesterday was the weekend, we need infrastructure's access to turn things public on Monday.
Hi,
https://huggingface.co/datasets/Salesforce/blip3-ocr-200m
https://huggingface.co/datasets/Salesforce/blip3-grounding-50m
It's still giving me a 404 error.
Can you please let us know? Thanks in advance. :)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models (2024)
- CROME: Cross-Modal Adapters for Efficient Multimodal LLM (2024)
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity (2024)
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (2024)
- mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend