Papers
arxiv:2308.00688

AnyLoc: Towards Universal Visual Place Recognition

Published on Aug 1, 2023
Authors:
,
,
,

Abstract

Visual Place Recognition (VPR) is vital for robot localization. To date, the most performant VPR approaches are environment- and task-specific: while they exhibit strong performance in structured environments (predominantly urban driving), their performance degrades severely in unstructured environments, rendering most approaches brittle to robust real-world deployment. In this work, we develop a universal solution to VPR -- a technique that works across a broad range of structured and unstructured environments (urban, outdoors, indoors, aerial, underwater, and subterranean environments) without any re-training or fine-tuning. We demonstrate that general-purpose feature representations derived from off-the-shelf self-supervised models with no VPR-specific training are the right substrate upon which to build such a universal VPR solution. Combining these derived features with unsupervised feature aggregation enables our suite of methods, AnyLoc, to achieve up to 4X significantly higher performance than existing approaches. We further obtain a 6% improvement in performance by characterizing the semantic properties of these features, uncovering unique domains which encapsulate datasets from similar environments. Our detailed experiments and analysis lay a foundation for building VPR solutions that may be deployed anywhere, anytime, and across anyview. We encourage the readers to explore our project page and interactive demos: https://anyloc.github.io/.

Community

Proposes AnyLoc: Anytime (day vs. night), anywhere (generic to structured and unstructured environments - subterranean, degraded, underwater, etc.), and any view Visual Place Recognition (VPR) using (general purpose) features from off-the-shelf self supervised methods (combined with unsupervised feature aggregation - no VPR-specific training or fine-tuning). Uses DINOv2 as SSL (foundation) model; VLAD, GeM as pooling methods. Has provision for generic “domain-specific” vocabularies (VLAD cluster centers). Experiments with SSL: joint embedding models (DINO, DINOv2), contrastive learning methods (CLIP), and masked image modeling (MAE). Extract features from facets (key, query, value, and token/output) of intermediate layers/blocks of ViT (discard CLS - which is the global token/representation). Use GeM pooled global descriptors (projected to low-dimensional/2D) to select datasets (group into domains). VLAD (vector of locally aggregated descriptors - build vocabulary by clustering features and then aggregate/concat residuals) over L31-value of ViT-G/14 DINOv2 features gives best VPR (32 clusters). CLS of foundation models already outperforms supervised (specialised) baselines, aggregation boosts performance further. Domain specific vocabularies outperform map/dataset specific and global vocabularies. Also has VLAD cluster assignment visualizations; and ablations over models (DINOv2 over DINO), facets (key for DINO, value for DINOv2), layers (L31 for v2, L9 for v1), and aggregation techniques (hard VLAD better than GeM). From CMU, IIIT-H, MIT, University of Adelaide.

Links: Website, PapersWithCode, GitHub, HuggingFace Space

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2308.00688 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2308.00688 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 3