@merve on Hugging Face: "Explaining a new state-of-the-art monocular depth estimation model: Depth…"

Post

Explaining a new state-of-the-art monocular depth estimation model: Depth Anything ✨ 🧶
Before we begin: Depth Anything is recently integrated to 🤗 transformers and you can use it with three lines of code! ✨

from transformers import pipeline

pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-small-hf")
depth = pipe(image)["depth"]

We have also built an app for you to compare different depth estimation models 🐝 🌸 merve/compare_depth_models
Check out Depth Anything in Web by @Xenova Xenova/depth-anything-web

The model's success heavily depends on unlocking the use of unlabeled datasets, although initially the authors used self-training and failed.
What the authors have done:
➰ Train a teacher model on labelled dataset
➰ Guide the student using teacher and also use unlabelled datasets pseudolabelled by the teacher
However, this was the cause of the failure, as both architectures were similar, the outputs were the same.
So the authors have added a more difficult optimization target for student to learn additional knowledge on unlabeled images that went through color jittering, distortions, Gaussian blurring and spatial distortion, so it can learn more invariant representations from them.
The architecture consists of DINOv2 encoder to extract the features followed by DPT decoder. At first, they train the teacher model on labelled images, and then they jointly train the student model and add in the dataset pseudo-labelled by ViT-L.
Thanks to this, Depth Anything performs very well! I have also benchmarked the inference duration of the model against different models here. I also ran torch.compile benchmarks across them and got nice speed-ups 🚀 https://huggingface2.notion.site/DPT-Benchmarks-1e516b0ba193460e865c47b3a5681efb?pvs=4

Join the conversation