DinoDepth — SDT depth heads on a frozen DINOv3 backbone

Trained Simple Depth Transformer (SDT) decoder heads for zero-shot affine-invariant (relative) monocular depth, reproducing AnyDepth (arXiv:2601.02760) on a frozen DINOv3 backbone. Only the small SDT decoder is trained; the DINOv3 encoder is frozen and loaded separately from Meta's checkpoints. Two heads are provided in this repo:

File	Backbone	Decoder params	Train
`sdt-vitl16.safetensors`	DINOv3 ViT-L/16	13.4 M	5 epochs
`sdt-vits16.safetensors`	DINOv3 ViT-S/16	5.5 M	10 epochs

These are decoder weights only (~13/5 M params) — pair each with its matching frozen DINOv3 backbone (facebook/dinov3-vitl16-pretrain-lvd1689m / -vits16-).

Zero-shot benchmark (our protocol)

AbsRel ↓ / δ1 ↑ on NYUv2 (Eigen 654) and KITTI (Eigen 652), scored with per-image least-squares scale+shift alignment in disparity space, Eigen/Garg crop, 10 m / 80 m cap.

Model	NYU AbsRel	NYU δ1	KITTI AbsRel	KITTI δ1
ViT-L/16 + SDT (this repo)	0.068	0.955	0.093	0.911
ViT-S/16 + SDT (this repo)	0.091	0.917	0.115	0.852
AnyDepth ViT-L (paper)	0.060	—	0.086	—
AnyDepth ViT-S (paper)	0.082	—	0.102	—

A faithful reproduction — ~0.01 AbsRel behind the paper on each benchmark (consistent across both backbones; plausibly the augmentation/data-filtering details AnyDepth underspecifies).

Usage

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from dinov3_depth.head import DepthModel, DepthModelConfig

# Frozen DINOv3 ViT-L/16 + (randomly-initialised) SDT head; default config matches the trained head
# (GroupNorm, fusion_channels=256).
model = DepthModel.from_pretrained(DepthModelConfig(backbone="vitl16"))
head = hf_hub_download("blanchon/dinodepth-model", "sdt-vitl16.safetensors")
model.head.load_state_dict(load_file(head))
model.eval()

# images: float [B, 3, H, W] in [0, 1], H and W multiples of 16. Returns affine-invariant disparity.
disparity = model(images)

(Use backbone="vits16" + sdt-vits16.safetensors for the small head.)

Architecture & training

Encoder: frozen DINOv3 ViT (LVD-1689M), patch 16; 4 intermediate layers tapped.
Decoder (SDT): softmax-fuse the 4 tapped layers at the patch grid → depthwise detail enhancer → two learned DySample ×4 upsamplers → output conv. GroupNorm, fusion width 256.
Loss: scale-and-shift-invariant + multi-scale gradient matching (1:2), on disparity.
Data: the harmonized 369K-image corpus at blanchon/dinodepth-dataset (Hypersim, VKITTI2, BlendedMVS, IRS, TartanAir). 768² input, AdamW lr 1e-3, PolyLR.

References

AnyDepth (arXiv:2601.02760) · DINOv3 (arXiv:2508.10104)
Dataset: blanchon/dinodepth-dataset

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train blanchon/dinodepth-model

Papers for blanchon/dinodepth-model

AnyDepth: Depth Estimation Made Easy

Paper • 2601.02760 • Published Jan 6 • 11

DINOv3

Paper • 2508.10104 • Published Aug 13, 2025 • 309