zivchen commited on
Commit
5b463a1
·
verified ·
1 Parent(s): d84e152

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -3
README.md CHANGED
@@ -1,3 +1,115 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license: apache-2.0
4
+ task_categories:
5
+ - video-retrieval
6
+ - image-retrieval
7
+ tags:
8
+ - composed-video-retrieval
9
+ - composed-image-retrieval
10
+ - multimodal-retrieval
11
+ - vision-language
12
+ - pytorch
13
+ - acm-mm-2025
14
+ ---
15
+
16
+ <a id="top"></a>
17
+ <div align="center">
18
+ <h1>📹 (ACM MM 2025) HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval (Model Weights)</h1>
19
+ <div align="center">
20
+ <a target="_blank" href="https://zivchen-ty.github.io/">Zhiwei&#160;Chen</a><sup>1</sup>,
21
+ <a target="_blank" href="https://faculty.sdu.edu.cn/huyupeng1/zh_CN/index.htm">Yupeng&#160;Hu</a><sup>1&#9993</sup>,
22
+ <a target="_blank" href="https://lee-zixu.github.io/">Zixu&#160;Li</a><sup>1</sup>,
23
+ <a target="_blank" href="https://zhihfu.github.io/">Zhiheng&#160;Fu</a><sup>1</sup>,
24
+ <a target="_blank" href="https://haokunwen.github.io">Haokun&#160;Wen</a><sup>2</sup>,
25
+ <a target="_blank" href="https://homepage.hit.edu.cn/guanweili">Weili&#160;Guan</a><sup>2</sup>
26
+ </div>
27
+ <sup>1</sup>School of Software, Shandong University &#160&#160&#160</span>
28
+ <br />
29
+ <sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), &#160&#160&#160</span> <br />
30
+ <sup>&#9993&#160;</sup>Corresponding author&#160;&#160;</span>
31
+ <br/>
32
+ <p>
33
+ <a href="https://acmmm2025.org/"><img src="https://img.shields.io/badge/ACM_MM-2025-blue.svg?style=flat-square" alt="ACM MM 2025"></a>
34
+ <a href="https://doi.org/10.1145/3746027.3755445"><img alt='Paper' src="https://img.shields.io/badge/Paper-dl.acm-green.svg?style=flat-square"></a>
35
+ <a href="https://zivchen-ty.github.io/HUD.github.io/"><img alt='Project Page' src="https://img.shields.io/badge/Website-orange?style=flat-square"></a>
36
+ <a href="https://github.com/ZivChen-Ty/HUD"><img alt='GitHub' src="https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github"></a>
37
+ </p>
38
+ </div>
39
+
40
+ This repository hosts the official pre-trained model weights for **HUD**, a novel framework tackling both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks by explicitly leveraging the disparity in information density between modalities.
41
+
42
+ ---
43
+
44
+ ## 📌 Model Information
45
+
46
+ ### 1. Model Name
47
+ **HUD** (Hierarchical Uncertainty-Aware Disambiguation Network) Checkpoints.
48
+
49
+ ### 2. Task Type & Applicable Tasks
50
+ - **Task Type:** Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR).
51
+ - **Applicable Tasks:** Retrieving a target video or image based on a reference visual input and a text modifier. HUD excels at addressing modification subject referring ambiguity and limited detailed semantic focus.
52
+
53
+ ### 3. Project Introduction
54
+ **HUD** is the first framework that explicitly leverages the disparity in information density between video and text. It achieves State-of-the-Art (SOTA) performance through three key modules:
55
+ - 🎯 **Holistic Pronoun Disambiguation:** Exploits overlapping semantics through holistic cross-modal interaction to indirectly disambiguate pronoun referents.
56
+ - 🔍 **Atomistic Uncertainty Modeling:** Discerns key detail semantics via uncertainty modeling at the atomistic level, enhancing focus on fine-grained visual details.
57
+ - ⚖️ **Holistic-to-Atomistic Alignment:** Adaptively aligns the composed query representation with the target media by incorporating a learnable similarity bias.
58
+
59
+ ### 4. Training Data Source & Hosted Weights
60
+ The HUD framework supports both video and image retrieval benchmarks. This repository provides pre-trained checkpoints evaluated on the following datasets:
61
+ * **CVR:** WebVid-CoVR dataset.
62
+ * **CIR:** FashionIQ and CIRR datasets.
63
+
64
+ *(Note: Download the respective `.ckpt` files hosted in the "Files and versions" tab of this repository).*
65
+
66
+ ---
67
+
68
+ ## 🚀 Usage & Basic Inference
69
+
70
+ These weights are designed to be evaluated using the highly modular, Hydra-configured [HUD GitHub repository](https://github.com/ZivChen-Ty/HUD).
71
+
72
+ ### Step 1: Prepare the Environment
73
+ We recommend using Anaconda. Clone the repository and install dependencies:
74
+ ```bash
75
+ git clone https://github.com/iLearn-Lab/MM25-HUD
76
+ cd MM25-HUD
77
+ conda create -n hud python=3.8.10 -y
78
+ conda activate hud
79
+ conda install pytorch==2.1.0 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
80
+ pip install -r requirements.txt
81
+ ```
82
+
83
+ ### Step 2: Download Model Weights
84
+ Download the specific checkpoints from this Hugging Face repository and place them into your local directory. Ensure your dataset paths are correctly configured in `configs/machine/default.yaml`.
85
+
86
+ ### Step 3: Run Evaluation
87
+ To evaluate a trained model, use `test.py` and specify the target benchmark and checkpoint path via Hydra overrides:
88
+ ```bash
89
+ python3 test.py \
90
+ model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \
91
+ +test=webvid-covr # or fashioniq / cirr-all
92
+ ```
93
+
94
+ ---
95
+
96
+ ## ⚠️ Limitations & Notes
97
+
98
+ - **Configuration:** HUD is entirely managed by **Hydra** and **Lightning Fabric**. Make sure to override configurations via the CLI or modify the YAML files in the `configs/` directory as needed.
99
+ - **Hardware & Environment:** The project was specifically developed and tested on Python 3.8.10, PyTorch 2.1.0, and an NVIDIA A40 48G GPU. Using significantly different environment settings may impact reproducibility.
100
+
101
+ ---
102
+
103
+ ## 📝⭐️ Citation
104
+
105
+ If you find our framework, code, or these weights useful in your research, please consider leaving a **Star** ⭐️ on our GitHub repository and citing our ACM MM 2025 paper:
106
+
107
+ ```bibtex
108
+ @inproceedings{HUD,
109
+ title = {HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval},
110
+ author = {Chen, Zhiwei and Hu, Yupeng and Li, Zixu and Fu, Zhiheng and Wen, Haokun and Guan, Weili},
111
+ booktitle = {Proceedings of the ACM International Conference on Multimedia},
112
+ pages = {6143–6152},
113
+ year = {2025}
114
+ }
115
+ ```