arxiv:2201.10168

Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos

Published on Jan 25, 2022

Authors:

Sangmin Woo ,

Abstract

Natural Language Video Grounding (NLVG) aims to localize time segments in an untrimmed video according to sentence queries. In this work, we present a new paradigm named Explore-And-Match for NLVG that seamlessly unifies the strengths of two streams of NLVG methods: proposal-free and proposal-based; the former explores the search space to find time segments directly, and the latter matches the predefined time segments with ground truths. To achieve this, we formulate NLVG as a set prediction problem and design an end-to-end trainable Language Video Transformer (LVTR) that can enjoy two favorable properties, which are rich contextualization power and parallel decoding. We train LVTR with two losses. First, temporal localization loss allows time segments of all queries to regress targets (explore). Second, set guidance loss couples every query with their respective target (match). To our surprise, we found that training schedule shows divide-and-conquer-like pattern: time segments are first diversified regardless of the target, then coupled with each target, and fine-tuned to the target again. Moreover, LVTR is highly efficient and effective: it infers faster than previous baselines (by 2X or more) and sets competitive results on two NLVG benchmarks (ActivityCaptions and Charades-STA). Codes are available at https://github.com/sangminwoo/Explore-And-Match.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2201.10168 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2201.10168 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2201.10168 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.