One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models
Abstract
A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel. Transparent scenes make this ambiguity measurable: the same ray can pass through foreground glass and observe the background, turning the supervised target into a convention of annotation, data, and training rather than a scene-intrinsic truth. A learned predictor exposes this convention as its depth-layer preference. We introduce MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for measuring depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA). On MD-3k, leading depth foundation models exhibit diverse layer preferences under standard RGB input, showing that the same layered geometry can be resolved differently across models. We further find that Laplacian Visual Prompting (LVP), a training-free spectral input transformation, can substantially change the reported layer for certain frozen models. The strongest RGB/LVP pair, DAv2-L, reaches 75.5% ML-SRA. These results suggest that depth foundation models may express complementary geometric hypotheses that standard RGB inference leaves unexpressed. We invite the community to rethink depth supervision and evaluation through an ambiguity-aware lens, where multiple valid 3D interpretations are treated as geometric structure to be measured, preserved, and expressed.
Community
One Scene, Two Depths studies a simple but overlooked question in monocular depth foundation models: under layered visibility, when one visual ray contains multiple visible and geometrically valid depths, which depth does the model choose?
Our key view is that single-depth prediction under ambiguity exposes a model’s depth-layer preference, rather than an unbiased scene-intrinsic truth. The label itself can become a convention shaped by sensors, annotation, datasets, training mixtures, and evaluation metrics.
We introduce MultiDepth-3k (MD-3k), a real-world transparent-scene benchmark with sparse two-layer ordinal annotations, to measure whether a model reports the transparent foreground or the visible background. We further propose Laplacian Visual Prompting (LVP), a training-free spectral input transformation that queries the same frozen model differently.
A key finding is that some frozen single-output depth models can express complementary depth hypotheses under RGB vs. LVP inputs. On MD-3k, the strongest RGB/LVP pair reaches 75.5% ML-SRA, above the strict 56.4% duplicated single-hypothesis ceiling, and reaches 52.2% on Reverse cases where one depth map cannot satisfy both valid layer relations by construction.
The broader implication is that single-depth prediction may be an incomplete interface for learned 3D world models: standard RGB inference may reveal only one preferred slice of richer multi-layer geometric knowledge.
Get this paper in your agent:
hf papers read 2606.29600 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper