Papers
arxiv:2606.05011

CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

Published on Jun 3
· Submitted by
Yurim Jeon
on Jun 9
Authors:
,

Abstract

CIPER is a unified cross-view geo-localization framework that simultaneously performs city-scale retrieval and precise 3-DoF pose estimation using a shared transformer encoder and two-way pose decoder.

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.

Community

Paper author Paper submitter

Excited to share CIPER (Cross-view Image-retrieval and Pose-estimation transformER)!
Cross-view geo-localization—locating a ground image via a database of aerial images—is usually tackled either by retrieval (wide coverage, limited accuracy) or pose estimation (high accuracy, narrow search space). Cascading the two leads to error propagation and inconsistent features.
We instead unify both tasks in a single architecture, enabling city-scale retrieval and precise 3-DoF pose estimation to benefit from shared feature learning. CIPER uses a shared transformer encoder with task-specific tokens, a two-way pose decoder with bidirectional cross-attention to bridge the ground–aerial domain gap, and a set prediction strategy for stable 3-DoF regression.
Experiments on VIGOR, KITTI, and Ford Multi-AV show competitive performance, especially under limited field-of-view and arbitrary orientation—establishing CIPER as a robust baseline for unified cross-view localization.
Feedback and discussion welcome!

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.05011
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05011 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05011 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05011 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.