arxiv:2308.01236

Grounded Image Text Matching with Mismatched Relation Reasoning

Published on Aug 2, 2023

Authors:

Abstract

This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a <PRE_TAG>benchmark</POST_TAG> for evaluating pre-trained models on this task, with a focus on the challenging settings of limited data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained models lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. RCRN can be interpreted as a modular program and delivers strong performance in both length generalization and data efficiency.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2308.01236 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2308.01236 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2308.01236 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.