DinoDepth β SDT depth heads on a frozen DINOv3 backbone
Trained Simple Depth Transformer (SDT) decoder heads for zero-shot affine-invariant (relative) monocular depth, reproducing AnyDepth (arXiv:2601.02760) on a frozen DINOv3 backbone. Only the small SDT decoder is trained; the DINOv3 encoder is frozen and loaded separately from Meta's checkpoints. Two heads are provided in this repo:
| File | Backbone | Decoder params | Train |
|---|---|---|---|
sdt-vitl16.safetensors |
DINOv3 ViT-L/16 | 13.4 M | 5 epochs |
sdt-vits16.safetensors |
DINOv3 ViT-S/16 | 5.5 M | 10 epochs |
These are decoder weights only (~13/5 M params) β pair each with its matching frozen DINOv3
backbone (facebook/dinov3-vitl16-pretrain-lvd1689m / -vits16-).
Zero-shot benchmark (our protocol)
AbsRel β / Ξ΄1 β on NYUv2 (Eigen 654) and KITTI (Eigen 652), scored with per-image least-squares scale+shift alignment in disparity space, Eigen/Garg crop, 10 m / 80 m cap.
| Model | NYU AbsRel | NYU Ξ΄1 | KITTI AbsRel | KITTI Ξ΄1 |
|---|---|---|---|---|
| ViT-L/16 + SDT (this repo) | 0.068 | 0.955 | 0.093 | 0.911 |
| ViT-S/16 + SDT (this repo) | 0.091 | 0.917 | 0.115 | 0.852 |
| AnyDepth ViT-L (paper) | 0.060 | β | 0.086 | β |
| AnyDepth ViT-S (paper) | 0.082 | β | 0.102 | β |
A faithful reproduction β ~0.01 AbsRel behind the paper on each benchmark (consistent across both backbones; plausibly the augmentation/data-filtering details AnyDepth underspecifies).
Usage
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from dinov3_depth.head import DepthModel, DepthModelConfig
# Frozen DINOv3 ViT-L/16 + (randomly-initialised) SDT head; default config matches the trained head
# (GroupNorm, fusion_channels=256).
model = DepthModel.from_pretrained(DepthModelConfig(backbone="vitl16"))
head = hf_hub_download("blanchon/dinodepth-model", "sdt-vitl16.safetensors")
model.head.load_state_dict(load_file(head))
model.eval()
# images: float [B, 3, H, W] in [0, 1], H and W multiples of 16. Returns affine-invariant disparity.
disparity = model(images)
(Use backbone="vits16" + sdt-vits16.safetensors for the small head.)
Architecture & training
- Encoder: frozen DINOv3 ViT (LVD-1689M), patch 16; 4 intermediate layers tapped.
- Decoder (SDT): softmax-fuse the 4 tapped layers at the patch grid β depthwise detail enhancer β two learned DySample Γ4 upsamplers β output conv. GroupNorm, fusion width 256.
- Loss: scale-and-shift-invariant + multi-scale gradient matching (1:2), on disparity.
- Data: the harmonized 369K-image corpus at
blanchon/dinodepth-dataset(Hypersim, VKITTI2, BlendedMVS, IRS, TartanAir). 768Β² input, AdamW lr 1e-3, PolyLR.
References
- AnyDepth (arXiv:2601.02760) Β· DINOv3 (arXiv:2508.10104)
- Dataset: blanchon/dinodepth-dataset