Optimal Transport Aggregation for Visual Place Recognition
Abstract
The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.
Community
Proposes SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors) for VPR: reformulate soft cluster assignment of NetVLAD as an optimal transport problem; considers cluster-to-feature and feature-to-cluster relations, introduces a dustbin cluster (for discarding non-informative features); uses DINOv2 as backbone and fine-tunes it (AnyLoc uses DINOv2 off-the-shelf); outperforms re-ranking baselines (with just global retrieval). Unfreezing and training the last blocks of DINOv2 encoder gives substantial gains. Get patch descriptors and global descriptor (CLS) from DINOv2 (forward pass); assignment prior is through shared two FC layers (pass local patch features) - for the score matrix (which NetVLAD initialises using centroids from k-means); append score matrix with one more column (single scalar trainable parameter) for dustbin assignment (discard uninformative features); score matrix is now n, m+1 (n local image patch features, m clusters). Uses Sinkhorn algorithm (inspired from SuperGlue/prior work - see Eq. 3) for calculating cluster assignment (features' mass must be 1 and they must be effective distributed among clusters or the 'dustbin'); finds optimal transport assignment; drop the last 'dustbin' column. Instead of combining/concatenating residuals, do this: dimensionality reduction (pass all local patch features through a shared 2 FC layer network), aggregate (directly take summation of assignment dot with features instead of subtracting centroid - see Eq. 5) to get VLAD vector as matrix of shape (m, l), project global CLS token/descriptor (from DINOv2) through 2 layer MLP and concatenate with the aggregate result (intra-norm aggregate matrix and L2-norm resultant - like in VLAD/NetVLAD). Architecture has DINOv2 ViT-B with last 4 layers trainable (64 clusters, 768 to 128 dim-reduce for patches and 256 for CLS, 8448-dim descriptors); MixVPR style training for 4 epochs on GSV-Cities (converges in 30 minutes on an RTX 3090); multi-similarity loss. Beats NetVLAD, GeM, CosPlace, MixVPR, and EigenPlaces on MSLS, NordLand, Pitts250k-test, and SPED (higher recalls); better than two-stage methods like Patch-NetVLAD, TransVPR, and R2Former (higher recall and lower latency). Ablations show that DINOv2 is a better backbone (for feature extraction), DINOv2 NetVLAD has a much higher descriptor length, DINOv2 ViT-B is better (larger models overfit training set); also has ablations on components and number of DINOv2 ViT last layers to un-freeze/train. Qualitative visualization of heatmaps of local features (weights not assigned to dustbin) shows that model focuses on distinctive items in the scene (for VPR). From University of Zaragoza (Spain).
Links: arxiv (related: Sinkhorn Distances, MixVPR), GitHub
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper