File size: 5,590 Bytes
128ec4a 6007438 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
license: mit
language:
- en
base_model:
- openai/clip-vit-base-patch16
tags:
- multimodal-retrieval
- embedding-model
---
<h1 align="center">MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval</h1>
<p align="center">
<a href="https://arxiv.org/abs/2412.14475">
<img alt="Build" src="http://img.shields.io/badge/cs.CV-arXiv%3A2412.14475-B31B1B.svg">
</a>
<a href="https://github.com/VectorSpaceLab/MegaPairs">
<img alt="Build" src="https://img.shields.io/badge/Github-Code-blue">
</a>
<a href="https://huggingface.co/datasets/JUNJIE99/MegaPairs">
<img alt="Build" src="https://img.shields.io/badge/π€ Datasets-MegaPairs-yellow">
</p>
<p align="center">
</a>
<a href="https://huggingface.co/JUNJIE99/MMRet-base">
<img alt="Build" src="https://img.shields.io/badge/π€ Model-MMRet_base-yellow">
</a>
<a href="https://huggingface.co/JUNJIE99/MMRet-large">
<img alt="Build" src="https://img.shields.io/badge/π€ Model-MMRet_large-yellow">
</a>
<a href="https://huggingface.co/JUNJIE99/MMRet-MLLM">
<img alt="Build" src="https://img.shields.io/badge/π€ Model-MMRet_MLLM-yellow">
</a>
</p>
## News
```2024-12-27``` ππ MMRet-CLIP models are released in Huggingface: [MMRet-base](https://huggingface.co/JUNJIE99/MMRet-base) and [MMRet-large](https://huggingface.co/JUNJIE99/MMRet-large).
```2024-12-19``` ππ Release our paper: [MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval](https://arxiv.org/pdf/2412.14475).
## Release Plan
- [x] Paper
- [x] MMRet-base and MMRet-large models
- [ ] MMRet-MLLM model
- [ ] MegaPairs Dataset
- [ ] Evaluation code
- [ ] Fine-tuning code
## Introduction
In this project, we introduce **MegaPairs**, a novel data synthesis method that leverages open-domain images to create *heterogeneous KNN triplets* for universal multimodal retrieval. Our MegaPairs dataset contains over 26 million triplets, and we have trained a series of multimodal retrieval models, **MMRets**, including MMRet-CLIP (base and large) and MMRet-MLLM.
MMRets achieve state-of-the-art performance on four popular zero-shot composed image retrieval benchmarks and the massive multimodal embedding benchmark (MMEB). Extensive experiments demonstrate the ***efficiency, scalability, and generalization*** features of MegaPairs. Please refer to our [paper](https://arxiv.org/abs/2412.14475) for more details.
## Model Usage
### 1. MMRet-CLIP Models
You can easily use MMRet-CLIP models based on ```transformers```
```python
import torch
from transformers import AutoModel
MODEL_NAME = "JUNJIE99/MMRet-base" # or "JUNJIE99/MMRet-large"
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) # You must set trust_remote_code=True
model.set_processor(MODEL_NAME)
model.eval()
with torch.no_grad():
query = model.encode(
images = "./assets/cir_query.png",
text = "Make the background dark, as if the camera has taken the photo at night"
)
candidates = model.encode(
images = ["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"]
)
scores = query @ candidates.T
print(scores)
```
### 2. MMRet-MLLM Models
```Will be released soon.```
## Model Performance
### Zero-Shot Composed Image Retrieval
MMRet sets a new performance benchmark in zero-shot composed image retrieval tasks. On the CIRCO benchmark, our MMRet-base model, with only 149 million parameters, surpasses all previous models, including those with 50 times more parameters. Additionally, MMRet-MLLM achieves an 8.1% improvement over the previous state-of-the-art model.
<img src="./assets/res-zs-cir.png" width="800">
### Zero-Shot Performance on MMEB
MMRet-MLLM achieves state-of-the-art zero-shot performance on the Massive Multimodal Embedding Benchmark (MMEB), despite being trained only on the ImageText-to-Image paradigm. This demonstrates the excellent generalization capability of MegaPairs for multimodal embedding.
<img src="./assets/res-zs-mmeb.png" width="800">
### Fine-Tuning Performance on MMEB
After fine-tuning on downstream tasks, MMRet-MLLM maintains its leading performance. Notably, it surpasses the previous state-of-the-art by 7.1% on the MMEB out-of-distribution (OOD) set. These results demonstrate the robust generalization capability of MMRet-MLLM and highlight the potential of MegaPairs as foundational training data for universal multimodal embedding.
<img src="./assets/res-ft-mmeb.png" width="800">
### Performance Scaling
MegaPairs showcases **scalability**: MMRet-base improves as training data increases. It also demonstrates **efficiency**: with just 0.5M training samples, MMRet-base significantly outperforms MagicLens, which uses the same CLIP-base backbone and was trained on 36.7M samples.
<img src="./assets/res-scaling.png" width="800">
## License
The annotations for MegaPairs and the MMRet models are released under the [MIT License](LICENSE). The images in MegaPairs originate from the [Recap-Datacomp](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B), which is released under the CC BY 4.0 license.
## Citation
If you find this repository useful, please consider giving a star β and citation
```
@article{zhou2024megapairs,
title={MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval},
author={Zhou, Junjie and Liu, Zheng and Liu, Ze and Xiao, Shitao and Wang, Yueze and Zhao, Bo and Zhang, Chen Jason and Lian, Defu and Xiong, Yongping},
journal={arXiv preprint arXiv:2412.14475},
year={2024}
}
``` |