File size: 4,938 Bytes
c383d89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a18731
 
 
c383d89
 
 
 
 
 
 
 
 
e9afa98
c383d89
 
 
 
 
 
 
 
 
 
 
 
116c6c5
c326e8a
116c6c5
c383d89
 
 
 
 
 
 
 
 
 
 
bb12c3f
 
c383d89
 
 
 
bb12c3f
c383d89
 
 
e9afa98
c383d89
e9afa98
bb12c3f
c383d89
 
 
e9afa98
c383d89
bb12c3f
 
c383d89
bb12c3f
c383d89
bb12c3f
e9afa98
c383d89
 
 
e9afa98
116c6c5
 
 
 
 
 
 
 
 
 
 
c383d89
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: apache-2.0
datasets:
- TIGER-Lab/MMEB-train
language:
- en
metrics:
- accuracy
base_model:
- microsoft/Phi-3.5-vision-instruct
library_name: transformers
tags:
- Embedding
---

# VLM2Vec

This repo contains the code and data for [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160). In this paper, we aimed at building a unified multimodal embedding model for any tasks. Our model is based on converting an existing well-trained VLM (Phi-3.5-V) into an embedding model. The basic idea is to add an [EOS] token in the end of the sequence, which will be used as the representation of the multimodal inputs.

<img width="1432" alt="abs" src="https://raw.githubusercontent.com/TIGER-AI-Lab/VLM2Vec/refs/heads/main/figures//train_vlm.png">

## Release
Our model is being trained on MMEB-train and evaluated on MMEB-eval with contrastive learning. We only use in-batch negatives for training. Our best results were based on Lora training with batch size of 1024. We also have checkpoint with full training with batch size of 2048. Our results on 36 evaluation datasets are:
### Train/Eval Data
 - Train data: https://huggingface.co/datasets/TIGER-Lab/MMEB-train
 - Eval data: https://huggingface.co/datasets/TIGER-Lab/MMEB-eval

### VLM2Vec Checkpoints
 - [MMEB.lora8.bs1024](https://huggingface.co/TIGER-Lab/MMEB.lora8.bs1024/)
 - [MMEB.fullmodel.bs2048](https://huggingface.co/TIGER-Lab/MMEB.fullmodel.bs2048/)

### Github
 - [Github](https://github.com/TIGER-AI-Lab/VLM2Vec)

### Experimental Results
Our model can outperform the existing baselines by a huge margin.
<img width="900" alt="abs" src="https://raw.githubusercontent.com/TIGER-AI-Lab/VLM2Vec/refs/heads/main/figures//vlm2vec_results.png">

## How to use VLM2Vec

First you can clone our github
```bash
git clone https://github.com/TIGER-AI-Lab/VLM2Vec.git
pip -r requirements.txt
```

Then you can enter the directory to run the following command.
```python
from src.model import MMEBModel
from src.arguments import ModelArguments
import torch
from transformers import HfArgumentParser, AutoProcessor
from PIL import Image
import numpy as np

model_args = ModelArguments(
    model_name='TIGER-Lab/VLM2Vec-Full',
    pooling='last',
    normalize=True)

model = MMEBModel.load(model_args)
model.eval()
model = model.to('cuda', dtype=torch.bfloat16)

processor = AutoProcessor.from_pretrained(
    model_args.model_name,
    trust_remote_code=True,
    num_crops=4,
)

# Image + Text -> Text
inputs = processor('<|image_1|> Represent the given image with the following question: What is in the image', [Image.open('figures/example.jpg')])
inputs = {key: value.to('cuda') for key, value in inputs.items()}
qry_output = model(qry=inputs)["qry_reps"]

string = 'A cat and a dog'
inputs = processor(string)
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## A cat and a dog = tensor([[0.2969]], device='cuda:0', dtype=torch.bfloat16)

string = 'A cat and a tiger'
inputs = processor(string)
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## A cat and a tiger = tensor([[0.2080]], device='cuda:0', dtype=torch.bfloat16)

# Text -> Image
inputs = processor('Find me an everyday image that matches the given caption: A cat and a dog.',)
inputs = {key: value.to('cuda') for key, value in inputs.items()}
qry_output = model(qry=inputs)["qry_reps"]

string = '<|image_1|> Represent the given image.'
inputs = processor(string, [Image.open('figures/example.jpg')])
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## <|image_1|> Represent the given image. = tensor([[0.3105]], device='cuda:0', dtype=torch.bfloat16)

inputs = processor('Find me an everyday image that matches the given caption: A cat and a tiger.',)
inputs = {key: value.to('cuda') for key, value in inputs.items()}
qry_output = model(qry=inputs)["qry_reps"]

string = '<|image_1|> Represent the given image.'
inputs = processor(string, [Image.open('figures/example.jpg')])
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## <|image_1|> Represent the given image. = tensor([[0.2158]], device='cuda:0', dtype=torch.bfloat16)
```

## Citation
```
@article{jiang2024vlm2vec,
  title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
  author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
  journal={arXiv preprint arXiv:2410.05160},
  year={2024}
}