DinoDepth β€” SDT depth heads on a frozen DINOv3 backbone

Trained Simple Depth Transformer (SDT) decoder heads for zero-shot affine-invariant (relative) monocular depth, reproducing AnyDepth (arXiv:2601.02760) on a frozen DINOv3 backbone. Only the small SDT decoder is trained; the DINOv3 encoder is frozen and loaded separately from Meta's checkpoints. Two heads are provided in this repo:

File Backbone Decoder params Train
sdt-vitl16.safetensors DINOv3 ViT-L/16 13.4 M 5 epochs
sdt-vits16.safetensors DINOv3 ViT-S/16 5.5 M 10 epochs

These are decoder weights only (~13/5 M params) β€” pair each with its matching frozen DINOv3 backbone (facebook/dinov3-vitl16-pretrain-lvd1689m / -vits16-).

Zero-shot benchmark (our protocol)

AbsRel ↓ / Ξ΄1 ↑ on NYUv2 (Eigen 654) and KITTI (Eigen 652), scored with per-image least-squares scale+shift alignment in disparity space, Eigen/Garg crop, 10 m / 80 m cap.

Model NYU AbsRel NYU Ξ΄1 KITTI AbsRel KITTI Ξ΄1
ViT-L/16 + SDT (this repo) 0.068 0.955 0.093 0.911
ViT-S/16 + SDT (this repo) 0.091 0.917 0.115 0.852
AnyDepth ViT-L (paper) 0.060 β€” 0.086 β€”
AnyDepth ViT-S (paper) 0.082 β€” 0.102 β€”

A faithful reproduction β€” ~0.01 AbsRel behind the paper on each benchmark (consistent across both backbones; plausibly the augmentation/data-filtering details AnyDepth underspecifies).

Usage

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from dinov3_depth.head import DepthModel, DepthModelConfig

# Frozen DINOv3 ViT-L/16 + (randomly-initialised) SDT head; default config matches the trained head
# (GroupNorm, fusion_channels=256).
model = DepthModel.from_pretrained(DepthModelConfig(backbone="vitl16"))
head = hf_hub_download("blanchon/dinodepth-model", "sdt-vitl16.safetensors")
model.head.load_state_dict(load_file(head))
model.eval()

# images: float [B, 3, H, W] in [0, 1], H and W multiples of 16. Returns affine-invariant disparity.
disparity = model(images)

(Use backbone="vits16" + sdt-vits16.safetensors for the small head.)

Architecture & training

  • Encoder: frozen DINOv3 ViT (LVD-1689M), patch 16; 4 intermediate layers tapped.
  • Decoder (SDT): softmax-fuse the 4 tapped layers at the patch grid β†’ depthwise detail enhancer β†’ two learned DySample Γ—4 upsamplers β†’ output conv. GroupNorm, fusion width 256.
  • Loss: scale-and-shift-invariant + multi-scale gradient matching (1:2), on disparity.
  • Data: the harmonized 369K-image corpus at blanchon/dinodepth-dataset (Hypersim, VKITTI2, BlendedMVS, IRS, TartanAir). 768Β² input, AdamW lr 1e-3, PolyLR.

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train blanchon/dinodepth-model

Papers for blanchon/dinodepth-model