Papers
arxiv:2202.04947

OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos

Published on Oct 26, 2022
Authors:
,
,
,
,

Abstract

Audiovisual context enhances temporal action localization in egocentric videos, achieving improved performance over visual-only approaches.

Egocentric videos capture sequences of human activities from a first-person perspective and can provide rich multimodal signals. However, most current localization methods use third-person videos and only incorporate visual information. In this work, we take a deep look into the effectiveness of audiovisual context in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL). OWL leverages audiovisual information and context for egocentric temporal action localization (TAL). We validate our approach in two large-scale datasets, EPIC-Kitchens, and HOMAGE. Extensive experiments demonstrate the relevance of the audiovisual temporal context. Namely, we boost the localization performance (mAP) over visual-only models by +2.23% and +3.35% in the above datasets.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2202.04947 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2202.04947 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.