Transformers
Safetensors
retriever

Model Card for Model ID

ReT is a novel approach for multimodal document retrieval that supports both multimodal queries and documents. Unlike existing methods that only use features from the final layer of vision-and-language backbones, ReT employs a Transformer-based recurrent cell to leverage multi-level representations from different layers of both visual and textual backbones. The model features sigmoidal gates inspired by LSTM design that selectively control information flow between layers and modalities. ReT processes multimodal queries and documents independently, producing sets of latent tokens used for fine-grained late interaction similarity computation. ReT is designed to process images and text in both queries and documents. To this end, it has been trained and evaluated on a custom version of the challenging M2KR benchmark, with the following modifications: MSMARCO has been excluded as it does not contain images, and the documents from OVEN, InfoSeek, E-VQA, and OKVQA have been enriched with the addition of images.

Model Sources

Use with Transformers

Follow the instructions on the repository to install the required environment.

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-OpenCLIP-ViT-H-14', device_map=device)

# QUERY
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# PASSAGE
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

Citation

BibTeX:

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}
Downloads last month
8
Safetensors
Model size
1.09B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aimagelab/ReT-OpenCLIP-ViT-H-14

Finetuned
(2)
this model

Dataset used to train aimagelab/ReT-OpenCLIP-ViT-H-14

Collection including aimagelab/ReT-OpenCLIP-ViT-H-14