arxiv:2305.15404

RoMa: Revisiting Robust Losses for Dense Feature Matching

Published on May 24, 2023

Authors:

Abstract

Dense feature matching is an important computer vision task that involves estimating all correspondences between two images of a 3D scene. In this paper, we revisit robust losses for matching from a Markov chain perspective, yielding theoretical insights and large gains in performance. We begin by constructing a unifying formulation of matching as a Markov chain, based on which we identify two key stages which we argue should be decoupled for matching. The first is the coarse stage, where the estimated result needs to be globally consistent. The second is the refinement stage, where the model needs precise localization capabilities. Inspired by the insight that these stages concern distinct issues, we propose a coarse matcher following the regression-by-classification paradigm that provides excellent globally consistent, albeit not exactly localized, matches. This is followed by a local feature refinement stage using well-motivated robust regression losses, yielding extremely precise matches. Our proposed approach, which we call RoMa, achieves significant improvements compared to the state-of-the-art. Code is available at https://github.com/Parskatt/RoMa

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Dec 1, 2023

Introduces RoMa: visualises dense local feature matching as a Markov chain and gets two stages (global consistency, then local refinement); globally consistent coarse matcher through a regression-by-classification paradigm; local feature refinement through robust regression losses; matching through a probabilistic perspective. Markos chain: probability of current state depends only on previous state; diffusion learns reverse of a known forward process (denoising), connects scale-space diffusion to Markov-chain formulation (define a forward process for matching and learn the reverse process). Markov chain is the multi-scale matching process (current scale matches depend on the image encoded features of the current scale only). Optimize the variational lower bound (as in diffusion models). Motion boundaries mix in large scales, giving multi-modal conditional distribution; coarse matching loss is KL-Divergence. Decouple the coarse matching and refinement stages; encoders output the features that condition the matching while decoders estimate the warp. Uses DINOv2 ViT for getting coarse features (encoder) and proposes a position encoding free transformer decoder; VGG16 local features for refinement. Instead of unimodal loss for coarse matching (doesn't work well at motion boundaries), predict a free-form discretized distribution through regression-by-classification (break output space into K quantization levels) - Eq 13; uses generalized robust loss from Barron (generalizes Chabonnier and Cauchy losses/distributions) - Eq 16; final loss is a sum of both, minimize risk on a meta-dataset containing image pairs (final loss function in Eq 17). Same training setup (for outdoor and indoor) as DKM (dense kernelized feature matching for geometry estimation). Metrics used: AUC (area under curve, approximated using trapezoidal rule), PCK (percent correct keypoints, warp precision at pixel thresholds). Ablations shows baseline (DKM) is improved by decoupling and further improved with robust losses. SOTA on outdoor homography/geometry estimation on HPatches, MegaDepth, and IMC 2022 (compared to LoFTR, SuperGlue, ASpanFormer, DKM, PMatch), also in indoor on ScanNet-1500 (AUC metric). Future work could incorporate self-supervision like in PMatch and explore other ways of a forward process. Appendix has dataset licenses, additional benchmarks (MegaDepth and IMC 2022), qualitative examples (robustness to viewpoint change), defines evaluation metrics (even pose, homography, and warp precision), coarse transformer decoder architecture (also takes GP module output from DKM), training (4 days on 4 A100s) and evaluation details (training resolution 448 for ablation and 560 for main, inference at 672 - upsampled to 1344). From Linkoping, Chalmers (Sweden), East China UST.

Links: website, arxiv (related: DKM, PMatch), PapersWithCode, also see (YouTube - Quick intro to VAEs for probability, Mishkin's WxBS results tweet), GitHub

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2305.15404 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2305.15404 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2305.15404 in a Space README.md to link it from this page.