Update README.md

1dee0ee verified 2 days ago

5.58 kB

license: apache-2.0
task_categories:
  - video-retrieval
  - image-retrieval
tags:
  - composed-video-retrieval
  - composed-image-retrieval
  - vision-language
  - pytorch
  - icassp-2026

🎬 (ICASSP 2026) RELATE: Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration (Model Weights)

Shiqi Zhang¹, Zhiwei Chen¹, Zixu Li¹, Zhiheng Fu¹, Wenbo Wang¹, Jiajia Nie¹, Yinwei Wei¹, Yupeng Hu^1✉

¹School of Software, Shandong University
^✉Corresponding author

This repository hosts the official pre-trained model weights for RELATE, a minimal-redundancy hierarchical collaborative network designed to enhance both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks.

📌 Model Information

1. Model Name

RELATE (Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration) Checkpoints.

2. Task Type & Applicable Tasks

Task Type: Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR).
Applicable Tasks: Retrieving target videos or images based on a reference visual input and modification text. The model excels by addressing the neglect of the internal hierarchical structure of modification texts and the insufficient suppression of video temporal redundancy.

3. Project Introduction

RELATE is an advanced open-source PyTorch framework built on top of BLIP-2. It achieves State-of-the-Art (SOTA) performance across major benchmarks through three key innovations:

🧩 Hierarchical Query Generation: Parses the internal hierarchical structure of the text to understand the roles of various parts of speech, using noun phrases for object-level features and the complete text for global semantics.
✂️ Temporal Sparsification: Adaptively attenuates redundant tokens corresponding to static backgrounds while amplifying critical dynamic information tokens.
🎯 Modification-Driven Modulation Learning: Leverages global semantics of the modification text to perform attention-based filtering on the sparsified visual features.

4. Training Data Source & Hosted Weights

The RELATE framework seamlessly supports and is evaluated on standard video and image retrieval benchmarks. This repository provides pre-trained weights for the following datasets:

CVR: WebVid-CoVR dataset.
CIR: FashionIQ and CIRR datasets.

(Note: Please download the respective .ckpt or .pt files hosted in the "Files and versions" tab of this Hugging Face repository).

🚀 Usage & Basic Inference

These weights are designed to be evaluated using the official Hydra-configured RELATE GitHub repository.

Step 1: Prepare the Environment

We recommend using Anaconda to manage your environment. Clone the repository and install the required dependencies:

git clone https://github.com/iLearn-Lab/ICASSP26-RELATE
cd RELATE
conda create -n relate python=3.8 -y
conda activate relate
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Step 2: Download Model Weights

Download the required checkpoints from this repository and place them into your local workspace. Ensure your dataset paths are correctly configured in configs/machine/default.yaml.

Step 3: Run Evaluation

To evaluate a trained model, use test.py and specify the target benchmark and your checkpoint path via Hydra overrides:

python test.py \
    model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \
    +test=webvid-covr # or fashioniq / cirr-all

⚠️ Limitations & Notes

Configuration: The entire framework is managed by Hydra and Lightning Fabric. Ensure you adjust hyperparameter overrides or modify the YAML files in the configs/ directory to suit your specific local setup.
Environment Dependency: This project was developed and extensively tested with Python 3.8 and PyTorch 2.1.0.

📝⭐️ Citation

If you find our framework, code, or these weights useful in your research, please consider leaving a Star ⭐️ on our GitHub repository and citing our ICASSP 2026 paper:

@inproceedings{RELATE,
  title={RELATE: Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration},
  author={Zhang, Shiqi and Chen, Zhiwei and Li, Zixu and Fu, Zhiheng and Wang, Wenbo and Nie, Jiajia and Wei, Yinwei and Hu, Yupeng},
  booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing},
  year={2026}
}