TianchengGu commited on
Commit
8d22a3d
Β·
verified Β·
1 Parent(s): 37b9714

[Update] README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -3
README.md CHANGED
@@ -1,3 +1,157 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - TIGER-Lab/MMEB-train
5
+ language:
6
+ - en
7
+ base_model:
8
+ - Qwen/Qwen2-VL-7B-Instruct
9
+ library_name: transformers
10
+ tags:
11
+ - Retrieval
12
+ - Multimodal
13
+ - Embedding
14
+ pipeline_tag: image-text-to-text
15
+ ---
16
+
17
+ <div align="center">
18
+
19
+ <h1>UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning</h1>
20
+
21
+ <a href="https://scholar.google.com/citations?hl=zh-CN&user=9etrpbYAAAAJ">Tiancheng Gu*</a>,</span>
22
+ <a href="https://kaicheng-yang0828.github.io">Kaicheng Yang*</a>,</span>
23
+ <a href="https://kcz358.github.io/">kaichen Zhang</a>,</span>
24
+ <a href="https://scholar.google.com/citations?hl=zh-CN&user=1ckaPgwAAAAJ">Xiang An</a>,</span>
25
+ Ziyong Feng,</span> \
26
+ <a href="https://scholar.google.com/citations?hl=en&user=LatWlFAAAAAJ">Yueyi Zhang</a>,</span>
27
+ <a href="https://weidong-tom-cai.github.io">Weidong Cai</a>,</span>
28
+ <a href="https://jiankangdeng.github.io">Jiankang Deng</a>,</span>
29
+ <a href="https://lidongbing.github.io">Lidong Bing</a></span>
30
+
31
+ [![Project Website](https://img.shields.io/badge/🏑-Project%20Website-deepgray)](https://garygutc.github.io/UniME-v2/)
32
+ [![Paper](https://img.shields.io/badge/πŸ“„-Paper-b31b1b.svg)]()
33
+ [![GitHub](https://img.shields.io/badge/⭐-GitHub-black?logo=github)](https://github.com/GaryGuTC/UniME-v2)
34
+ </div>
35
+
36
+ ## πŸ’‘ Highlights
37
+ - We introduce an MLLM-as-a-Judge pipeline for hard negative mining that uses the advanced understanding capabilities of MLLM to assess the semantic alignment of each query-candidate pair within a globally retrieved potential hard negative set.
38
+
39
+ <div align="center">
40
+ <img src="Figures/method1.jpg" width="95%">
41
+ </div>
42
+
43
+ - We present UniME-V2, a novel universal multimodal embedding model trained with an MLLM judgment based distribution alignment framework. By leveraging semantic matching scores as soft labels, the model effectively captures semantic differences between candidates, significantly enhancing its discriminative capability. Meanwhile, we propose UniME-V2-Reranker, a reranking model trained on high-quality, diverse hard negatives through a joint pairwise and listwise optimization approach.
44
+
45
+ <div align="center">
46
+ <img src="Figures/method2.jpg" width="60%">
47
+ </div>
48
+
49
+ ## πŸ› οΈ Implementation
50
+
51
+ ## πŸš€ Quick Start
52
+ ```bash
53
+ git clone https://github.com/deepglint/UniME-v2.git
54
+ cd UniME-v2
55
+ ```
56
+
57
+ ```bash
58
+ conda create -n uniMEv2 python=3.10 -y
59
+ conda activate uniMEv2
60
+ pip install -r requirements.txt
61
+
62
+ # Optional: Install Flash Attention for acceleration
63
+ # wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
64
+ # pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
65
+ ```
66
+
67
+ ### πŸ” Embedding model & Rerank model
68
+ ```python
69
+ import torch
70
+ from torch.nn import functional as F
71
+ from utils.utils import init_model_and_processor, prepare_stage_data, parse_answer_index
72
+
73
+ device="cuda"
74
+ embedding=False # adjust embedding model or rerank model
75
+ if embedding:
76
+ model_name="models/UniME-V2_qwen2VL_2B"
77
+ # model_name="models/UniME-V2_qwen2VL_7B"
78
+ # model_name="models/UniME-V2_LLaVA_onevision_8B"
79
+ text = "A man is crossing the street with a red car parked nearby."
80
+ image_path = "Figures/demo.png"
81
+ else:
82
+ model_name="models/UniME-v2-rerank_qwen25VL_7B"
83
+ text = ["A man is crossing the street with a red car parked nearby.", #! Target text
84
+ "A woman is walking her dog with a blue bicycle leaning nearby.",
85
+ "A child is riding a scooter past a green truck stopped nearby.",
86
+ "A couple is waiting for the bus beside a yellow taxi parked nearby.",
87
+ "A jogger is running along the path with a black motorcycle parked nearby."]
88
+ image_path = "Figures/demo.png"
89
+
90
+ model, processor = init_model_and_processor(model_name, device, embedding=embedding)
91
+
92
+ if embedding:
93
+ inputs_image, inputs_txt = prepare_stage_data(model_name, processor, text, image_path, embedding=embedding)
94
+ inputs_image = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs_image.items()}
95
+ inputs_txt = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs_txt.items()}
96
+ with torch.no_grad():
97
+ emb_text = model(**inputs_txt, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
98
+ emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
99
+ emb_text = F.normalize(emb_text, dim=-1)
100
+ emb_image = F.normalize(emb_image, dim=-1)
101
+ Score = emb_image @ emb_text.T
102
+ print("Score: ", Score.item()) # qwen2VL 2B : Score: 0.62109375
103
+ else:
104
+ inputs = prepare_stage_data(model_name, processor, text, image_path, embedding=embedding)
105
+ inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
106
+ with torch.no_grad():
107
+ generated_ids = model.generate(**inputs, max_new_tokens=128, output_scores=True, return_dict_in_generate=True, do_sample=False).sequences
108
+ generated_ids_trimmed = [
109
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs['input_ids'], generated_ids)
110
+ ]
111
+ output_text = processor.batch_decode(
112
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
113
+ )
114
+ print("Rerank Answer: ", parse_answer_index(output_text[0])) # qwen25VL 7B: Rerank Answer: 0
115
+ ```
116
+
117
+ ## πŸ“Š Results
118
+
119
+ ### 🌈 Diversity Retrieval
120
+ <div align="center">
121
+ <img src="Figures/UniME_v2_diversity_retrieval.png" width="90%">
122
+ </div>
123
+
124
+
125
+ ### πŸ† MMEB
126
+ <div align="center">
127
+ <img src="Figures/UniME_v2_MMEB.png" width="90%">
128
+ </div>
129
+
130
+ ## πŸ’¬ Support
131
+ | Team Member | Email |
132
+ |-------------|-------|
133
+ | **Tiancheng Gu** | [![Email](https://img.shields.io/badge/πŸ“§-gtcivy01@outlook.com-red?logo=gmail)](mailto:gtcivy01@outlook.com) |
134
+ | **Kaicheng Yang** | [![Email](https://img.shields.io/badge/πŸ“§-kaichengyang@deepglint.com-red?logo=gmail)](mailto:kaichengyang@deepglint.com) |
135
+
136
+
137
+ ## πŸ–ŠοΈ Citation
138
+ If you find this repository useful, please use the following BibTeX entry for citation.
139
+ ```latex
140
+ Coming Soon !!!
141
+
142
+ @misc{gu2025unime,
143
+ title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs},
144
+ author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
145
+ year={2025},
146
+ eprint={2504.17432},
147
+ archivePrefix={arXiv},
148
+ primaryClass={cs.CV},
149
+ url={https://arxiv.org/abs/2504.17432},
150
+ }
151
+
152
+ ```
153
+
154
+ <div align="center">
155
+ ⭐ Don't forget to star this repository if you find it helpful!
156
+
157
+ </div>