Sungyeon commited on
Commit
90a2632
·
verified ·
1 Parent(s): 0eb4cc9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +229 -3
README.md CHANGED
@@ -1,3 +1,229 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GENIUS
2
+
3
+ This repo contains the codebase for the CVPR 2025 paper "[GENIUS: A Generative Framework for Universal Multimodal Search](https://arxiv.org/pdf/2503.19868)"
4
+
5
+ <div align="center" style="line-height: 1;">
6
+ <a href="https://arxiv.org/pdf/2503.19868" target="_blank" style="margin: 2px;"><img alt="arXiv" src="https://img.shields.io/badge/📄%20arXiv-2503.19868-b31b1b?color=b31b1b&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> <a href="https://sung-yeon-kim.github.io/project_pages/GENIUS/index.html" target="_blank" style="margin: 2px;"><img alt="Project Page" src="https://img.shields.io/badge/🌐%20Project%20Page-GENIUS-ff6b6b?color=ff6b6b&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> <a href="https://github.com/sung-yeon-kim/GENIUS" target="_blank" style="margin: 2px;"><img alt="GitHub" src="https://img.shields.io/badge/💻%20GitHub-GENIUS-2ea44f?color=2ea44f&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> <a href="https://huggingface.co/Sungyeon/GENIUS" target="_blank" style="margin: 2px;"><br><img alt="HuggingFace" src="https://img.shields.io/badge/🤗%20Checkpoint-GENIUS-ffd700?color=ffd700&logoColor=black" style="display: inline-block; vertical-align: middle;"/></a> <a href="LICENSE" target="_blank" style="margin: 2px;"><img alt="License" src="https://img.shields.io/badge/📜%20License-MIT-4b0082?color=4b0082&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
7
+ </div>
8
+
9
+ ## Introduction
10
+
11
+ We propose **GENIUS**, a **universal generative retrieval framework** that supports diverse tasks across multiple modalities. By learning discrete, modality‐decoupled IDs via **semantic quantization**, GENIUS encodes multimodal data into compact identifiers and performs constant‐time retrieval with competitive accuracy.
12
+
13
+ <p align="center">
14
+ <img src="misc/cvpr_25_genius.png" alt="GENIUS Overview" width="55%">
15
+ </p>
16
+
17
+ ### ✨ Key Advantages
18
+
19
+ - **Universal Retrieval**
20
+ Single model handles various retrieval tasks including image‐to‐image, text‐to‐text, image‐to‐text, text‐to‐image, and their combinations.
21
+
22
+ - **Fast Retrieval**
23
+ Constant‐time lookup via discrete ID matching, independent of candidate pool size.
24
+
25
+ - **Competitive Accuracy**
26
+ Comparable to—and sometimes better than—embedding‐based methods, while significatnly reducing inference cost.
27
+
28
+ ## Overview
29
+
30
+ GENIUS consists of three key components that work together in a three-stage training pipeline:
31
+
32
+ 1. **Multimodal Encoder (CLIP-SF)**
33
+ Extracts joint image/text features using a shared backbone. We leverage **UniIR's score‐fusion CLIP model** to learn cross‐modal relations without extra pretraining, with pretrained checkpoints available on [Hugging Face](https://huggingface.co/TIGER-Lab/UniIR/blob/main/checkpoint/CLIP_SF/clip_sf_large.pth).
34
+
35
+ 2. **Modality-Decoupled Quantizer**
36
+ Compresses continuous embeddings into discrete, layered ID codes including modality and semantic information. Through **residual quantization training**, it learns to encode both images and text into layered, discrete IDs:
37
+ - First code: **modality indicator** (0 = image, 1 = text, 2 = image‐text)
38
+ - Subsequent codes: **semantic features** (objects → attributes → context)
39
+
40
+ <p align="center">
41
+ <img src="misc/semantic_quantization.png" alt="Semantic Quantization" width="95%">
42
+ </p>
43
+
44
+ 3. **ID Generator**
45
+ **Sequence‐to‐sequence decoder** that generates target IDs based on query embeddings. During training, it learns to predict discrete IDs from various input queries (images, text, or pairs with instruction). We employ **query augmentation** (query-target mixing) to improve generalization and incorporate **constrained beam search** during decoding to enforce valid ID sequences.
46
+
47
+ ---
48
+
49
+ ## Installation
50
+
51
+ Clone the repository and create the Conda environment:
52
+
53
+ ```bash
54
+ git clone https://github.com/sung-yeon-kim/GENIUS.git
55
+ cd GENIUS
56
+ conda env create -f genius_env.yml
57
+ ```
58
+
59
+
60
+ ## Usage
61
+
62
+ ### Download Pretrained CLIP-SF (Stage 0)
63
+
64
+ We utilize UniIR's score fusion model as a replacement for the encoder pretraining stage.
65
+
66
+ ```bash
67
+ mkdir -p checkpoint/CLIP_SF
68
+ wget https://huggingface.co/TIGER-Lab/UniIR/resolve/main/checkpoint/CLIP_SF/clip_sf_large.pth -O checkpoint/CLIP_SF/clip_sf_large.pth
69
+ ```
70
+
71
+
72
+ ### Feature Extraction (Preprocessing)
73
+
74
+
75
+ #### - Training Data
76
+
77
+ Extracts CLIP features for training set → `extracted_embed/CLIP_SF/train`.
78
+
79
+ ```bash
80
+ # Navigate to feature extraction directory
81
+ cd feature_extraction
82
+
83
+ # Run feature extraction for training data
84
+ bash run_feature_extraction_train.sh
85
+ ```
86
+
87
+ #### - Candidate Pool
88
+
89
+ Extracts CLIP features for the retrieval candidate pool → `extracted_embed/CLIP_SF/cand`.
90
+
91
+ ```bash
92
+ # Run feature extraction for candidate pool
93
+ bash run_feature_extraction_cand.sh
94
+ ```
95
+
96
+ ### Residual Quantization (Stage 1)
97
+
98
+ ```bash
99
+ cd models/residual_quantization
100
+ vim configs_scripts/large/train/inbatch/inbatch.yaml # Edit config like data path, batch size, etc.
101
+ bash configs_scripts/large/train/inbatch/run_inbatch.sh
102
+ ```
103
+
104
+ ### Generator Training (Stage 2)
105
+
106
+ ```bash
107
+ cd models/generative_retriever
108
+ vim configs_scripts/large/train/inbatch/inbatch.yaml # Edit config like data path, batch size, etc.
109
+ bash configs_scripts/large/train/inbatch/run_inbatch.sh
110
+ ```
111
+
112
+ ### Inference
113
+
114
+ 1. Extract CLIP features for candidate pool (if not already done):
115
+ ```bash
116
+ cd feature_extraction
117
+ bash run_feature_extraction_cand.sh
118
+ ```
119
+
120
+ 2. Compile trie_cpp (recommended for faster inference):
121
+ ```bash
122
+ cd models/generative_retriever/trie_cpp
123
+ c++ -O3 -Wall -shared -std=c++17 -fPIC \
124
+ $(python3 -m pybind11 --includes) \
125
+ trie_cpp.cpp -o trie_cpp$(python3-config --extension-suffix)
126
+ ```
127
+
128
+ 3. Run evaluation:
129
+ ```bash
130
+ cd models/generative_retriever
131
+ bash configs_scripts/large/eval/inbatch/run_eval.sh
132
+ ```
133
+ > For inference, you can choose between three trie implementations: `trie_cpp` (fastest), `trie` (Python), `marisa` (alternative).
134
+
135
+
136
+ ## Model Checkpoints
137
+
138
+ We provide GENIUS model checkpoints in the 🤗 [Hugging Face repository](https://huggingface.co/Sungyeon/GENIUS):
139
+
140
+ ### Stage 1: Residual Quantization Model
141
+ - **Model**: [`rq_clip_large.pth`](https://huggingface.co/Sungyeon/GENIUS/blob/main/checkpoint/rq_clip_large.pth)
142
+ - **Description**: Learns to encode multimodal data into discrete IDs through residual quantization
143
+ - **Size**: ~1.2GB
144
+
145
+ ### Stage 2: Generator Model
146
+ - **Model**: [`GENIUS_t5small.pth`](https://huggingface.co/Sungyeon/GENIUS/blob/main/checkpoint/GENIUS_t5small.pth)
147
+ - **Description**: T5-based sequence-to-sequence model that generates target IDs for retrieval
148
+ - **Size**: ~500MB
149
+
150
+ ### Stage 0: CLIP-SF Model
151
+ - **Model**: [`clip_sf_large.pth`](https://huggingface.co/TIGER-Lab/UniIR/blob/main/checkpoint/CLIP_SF/clip_sf_large.pth)
152
+ - **Source**: [TIGER-Lab/UniIR](https://huggingface.co/TIGER-Lab/UniIR)
153
+ - **Description**: Score-fusion CLIP model for multimodal feature extraction
154
+
155
+ ```bash
156
+ # Clone the repository
157
+ git clone https://huggingface.co/Sungyeon/GENIUS
158
+
159
+ # Download CLIP-SF model
160
+ wget https://huggingface.co/TIGER-Lab/UniIR/resolve/main/checkpoint/CLIP_SF/clip_sf_large.pth -O checkpoint/CLIP_SF/clip_sf_large.pth
161
+ ```
162
+
163
+ > Note: All three models are required for full functionality. The CLIP-SF model is used for feature extraction, the Residual Quantization model for ID encoding, and the Generator model for retrieval.
164
+
165
+ ## 📈 Performance
166
+
167
+ > The results in parentheses denote scores from our reimplemented checkpoints, as the originals were lost during server migration. While close to the paper, slight variations may occur due to retraining randomness.
168
+
169
+
170
+ ### Universal Information Retrieval
171
+
172
+ | Task | Dataset | CLIP_SF | BLIP_FF | GENIUS (checkpoint) | GENIUSᴿ (checkpoint) |
173
+ |:-----|:--------|:-------:|:-------:|:------:|:-------:|
174
+ | **T→I** | VisualNews | 42.6 | 23.0 | 18.5 (18.5) | 27.3 (27.3)|
175
+ | | MSCOCO | 77.9 | 75.6 | 55.1 (55.3) | 68.0 (68.0) |
176
+ | | Fashion200K | 17.8 | 25.4 | 13.7 (14.0) | 16.2 (15.9)|
177
+ | **T→T** | WebQA | 84.7 | 79.5 | 31.1 (31.9) | 42.9 (43.6)|
178
+ | **T→(I,T)** | EDIS | 59.4 | 50.3 | 36.6 (37.0) | 44.1 (44.1)|
179
+ | | WebQA | 78.8 | 79.7 | 49.0 (49.0) | 59.7 (59.3) |
180
+ | **I→T** | VisualNews | 42.8 | 21.1 | 18.4 (18.2) | 26.8 (26.8)|
181
+ | | MSCOCO | 92.3 | 88.8 | 82.7 (83.0) | 90.6 (90.7) |
182
+ | | Fashion200K | 17.9 | 27.6 | 12.8 (12.9) | 16.2 (16.6) |
183
+ | **I→I** | NIGHTS | 33.4 | 33.0 | 8.1 (8.1) | 30.2 (30.0) |
184
+ | | OVEN | 39.2 | 34.7 | 34.6 (34.5) | 38.0 (38.0) |
185
+ | **(I,T)→T** | InfoSeek | 24.0 | 19.7 | 10.4 (10.5) | 18.0 (18.0) |
186
+ | **(I,T)→I** | FashionIQ | 26.2 | 28.5 | 13.1 (13.1) | 19.2 (19.3) |
187
+ | | CIRR | 43.0 | 51.4 | 20.1 (20.1) | 38.3 (38.1)|
188
+ | **(I,T)→(I,T)** | OVEN | 60.2 | 57.8 | 36.5 (36.6) | 48.6 (48.3)|
189
+ | | InfoSeek | 44.6 | 27.7 | 14.2 (14.3) | 28.6 (28.7) |
190
+
191
+ ### ⚡ Efficiency
192
+
193
+ When the candidate pool grows, embedding‐based retrieval (e.g., CLIP + nearest neighbors) slows down dramatically. In contrast, GENIUS's discrete ID generation is **nearly constant time**. Empirically, GENIUS is roughly **4× faster** than competing generative methods like GRACE.
194
+
195
+ <p align="center"><img src="misc/efficiency.png" alt="cvpr25_genius" width="50%"></p>
196
+
197
+
198
+
199
+ ## Citation
200
+
201
+ If you find this work useful, please cite:
202
+
203
+ ```bibtex
204
+ @article{kim2024genius,
205
+ title={GENIUS: A Generative Framework for Universal Multimodal Search},
206
+ author={Kim, Sungyeon and Zhu, Xinliang and Lin, Xiaofan and Bastan, Muhammet and Gray, Douglas and Kwak, Suha},
207
+ journal={arXiv preprint arXiv:2503.19868},
208
+ year={2025}
209
+ }
210
+ ```
211
+
212
+ ## License
213
+
214
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
215
+
216
+ ## Acknowledgments & References
217
+
218
+ ### Codebases
219
+ Our implementation is built upon and modified from these great repositories:
220
+ - [UniIR](https://github.com/TIGER-AI-Lab/UniIR) - Base framework for multimodal retrieval
221
+ - [GENRE](https://github.com/facebookresearch/GENRE) - Trie structure implementation
222
+ - [vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch) - Vector quantization implementation
223
+ - [CLIP4CIR](https://github.com/ABaldrati/CLIP4Cir) - Combining module that integrates image and text features
224
+
225
+ ### Related Papers
226
+ - [UniIR: Training and Benchmarking Universal Multimodal Information Retrievers](https://arxiv.org/pdf/2311.17136)
227
+ - [Recommender Systems with Generative Retrieval](https://arxiv.org/pdf/2305.05065)
228
+ - [GRACE: Generative Cross-Modal Retrieval](https://arxiv.org/pdf/2402.10805)
229
+ - [IRGen: Generative Modeling for Image Retrieval](https://arxiv.org/pdf/2303.10126)