license: apache-2.0
task_categories:
- video-retrieval
- image-retrieval
tags:
- composed-video-retrieval
- composed-image-retrieval
- vision-language
- pytorch
- icassp-2026
π¬ (ICASSP 2026) RELATE: Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration (Model Weights)
β Corresponding author
This repository hosts the official pre-trained model weights for RELATE, a minimal-redundancy hierarchical collaborative network designed to enhance both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks.
π Model Information
1. Model Name
RELATE (Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration) Checkpoints.
2. Task Type & Applicable Tasks
- Task Type: Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR).
- Applicable Tasks: Retrieving target videos or images based on a reference visual input and modification text. The model excels by addressing the neglect of the internal hierarchical structure of modification texts and the insufficient suppression of video temporal redundancy.
3. Project Introduction
RELATE is an advanced open-source PyTorch framework built on top of BLIP-2. It achieves State-of-the-Art (SOTA) performance across major benchmarks through three key innovations:
- π§© Hierarchical Query Generation: Parses the internal hierarchical structure of the text to understand the roles of various parts of speech, using noun phrases for object-level features and the complete text for global semantics.
- βοΈ Temporal Sparsification: Adaptively attenuates redundant tokens corresponding to static backgrounds while amplifying critical dynamic information tokens.
- π― Modification-Driven Modulation Learning: Leverages global semantics of the modification text to perform attention-based filtering on the sparsified visual features.
4. Training Data Source & Hosted Weights
The RELATE framework seamlessly supports and is evaluated on standard video and image retrieval benchmarks. This repository provides pre-trained weights for the following datasets:
- CVR: WebVid-CoVR dataset.
- CIR: FashionIQ and CIRR datasets.
(Note: Please download the respective .ckpt or .pt files hosted in the "Files and versions" tab of this Hugging Face repository).
π Usage & Basic Inference
These weights are designed to be evaluated using the official Hydra-configured RELATE GitHub repository.
Step 1: Prepare the Environment
We recommend using Anaconda to manage your environment. Clone the repository and install the required dependencies:
git clone https://github.com/iLearn-Lab/ICASSP26-RELATE
cd RELATE
conda create -n relate python=3.8 -y
conda activate relate
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
Step 2: Download Model Weights
Download the required checkpoints from this repository and place them into your local workspace. Ensure your dataset paths are correctly configured in configs/machine/default.yaml.
Step 3: Run Evaluation
To evaluate a trained model, use test.py and specify the target benchmark and your checkpoint path via Hydra overrides:
python test.py \
model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \
+test=webvid-covr # or fashioniq / cirr-all
β οΈ Limitations & Notes
- Configuration: The entire framework is managed by Hydra and Lightning Fabric. Ensure you adjust hyperparameter overrides or modify the YAML files in the
configs/directory to suit your specific local setup. - Environment Dependency: This project was developed and extensively tested with Python 3.8 and PyTorch 2.1.0.
πβοΈ Citation
If you find our framework, code, or these weights useful in your research, please consider leaving a Star βοΈ on our GitHub repository and citing our ICASSP 2026 paper:
@inproceedings{RELATE,
title={RELATE: Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration},
author={Zhang, Shiqi and Chen, Zhiwei and Li, Zixu and Fu, Zhiheng and Wang, Wenbo and Nie, Jiajia and Wei, Yinwei and Hu, Yupeng},
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing},
year={2026}
}