Papers
arxiv:2303.08998

Unified Visual Relationship Detection with Vision and Language Models

Published on Mar 16, 2023
Authors:
,
,
,
,
,
,

Abstract

This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model. Our code will be made publicly available on GitHub.

Community

Proposes Unified Visual Relationship Detection (UniVRD) using vision language models (VLMs); across diverse dataset (diverse/heterogeneous labeling spaces); problem a subset of scene understanding. Bottom-up framework containing an object detector and pair-wise relationship decoder in a cascaded manner; fine-tune VLM encoders by adding heads to transformer encoder outputs (targeted towards object detection), append transformer decoder (for pair-wise object relationships) with optimization as set prediction (Hungarian algorithm). Given an image, output set of subject, predicate (verb/action), and object triplets (subject and object in bounding box). Object detector (like DETR): one object instance directly from each image token; bounding box through FFN; output is bounding boxes and instance embeddings (per patch/token/item in sequence length). Relationship decoder: takes relation queries (pre-learned) and instance embeddings (from object detector), pass to transformer decoder, produce relation embeddings; transformer decoder modified along the lines of Perceiver Resampler (concatenation of keys and values with learned instance embeddings to calculate K and V); add one linear layer over relation embeddings for classification of subject; FFN layer for location of object bounding box (subject and object embeddings); retrieve token indices (for bbox prediction lookup) by (arg)max similarity of subject and object embeddings across all instance embeddings (through object detector). Text embeddings are generated by prompts; separate object description prompt (for object detection) and relation prompt (from relationship triplets). Mosaic data augmentation for varying image scales and fusing/mixing samples from object detection and visual relationship detection datasets in the same batch. Continuous tense of predicate generated through python NLP library; AND used as filler for no predicate. Classification loss is focal sigmoid cross-entropy where predictions are similarities of instance embeddings with text queries; (bounding) box loss is linear combination of L1 and generalized IoU loss; combined loss (which gives the Hungarian loss for OD) defined over all image tokens. Relationship decoder uses focal softmax cross-entropy loss (it is predicting indices and subject and object indices are one-hot), classification for (a subset sampled) relation (classify visual relation); combination gives Hungarian loss for visual relationship decoding. During inference: object decoder gives bounding boxes, relation decoder gives relation embeddings and indices for subject and object; combine to get triplets; retrieval through per-class pair-wise non-maximal suppression (PNMS) after top-k retrieval by similarity check with given text embedding. Allows one-shot transfer (can query with even image modality). Human object interaction (HOI) using HICO-DET and V-COCO; Visual Genome (VG) for scene graph generation (SGG); also used COCO and objects 365 for improving object detection. Compared with CLIP and LiT VLMs (use latter). SOTA on HICO-DET (compared to all bottom-up methods), even beats single-stage methods. Better than general methods on SGG, however, DT2-ACBS is better (specific method). Unified model (training on all datasets) does better at HOI with little loss in VG (SGG). Implemented using JAX, part of Scenic library. Appendix has architecture details, training (dataset specific and unified models), training ablations, and V-COCO results. From Google.

Links: PapersWithCode, GitHub

Looking forward to the code release!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2303.08998 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2303.08998 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2303.08998 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.