vision_papers / pages /8_Depth_Anything.py
lbourdois's picture
Fix paths
51d0e78 verified
raw
history blame
4.87 kB
import streamlit as st
from streamlit_extras.switch_page_button import switch_page
st.title("Depth Anything")
st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1750531698008498431) (January 25, 2024)""", icon="ℹ️")
st.markdown(""" """)
st.markdown("""Explaining a new state-of-the-art monocular depth estimation model: Depth Anything ✨🧢
It has just been integrated in transformers for super-easy use.
We compared it against DPTs and benchmarked it as well! You can find the usage, benchmark, demos and more below πŸ‘‡
""")
st.markdown(""" """)
st.video("pages/Depth Anything/video_1.mp4", format="video/mp4")
st.markdown(""" """)
st.markdown("""
The paper starts with highlighting previous depth estimation methods and the limitations regarding the data coverage. πŸ‘€
The model's success heavily depends on unlocking the use of unlabeled datasets, although initially the authors used self-training and failed.
What the authors have done:
➰ Train a teacher model on labelled dataset
➰ Guide the student using teacher and also use unlabelled datasets pseudolabelled by the teacher. However, this was the cause of the failure, as both architectures were similar, the outputs were the same.
""")
st.markdown(""" """)
st.image("pages/Depth Anything/image_1.jpg", use_column_width=True)
st.markdown(""" """)
st.markdown("""
So the authors have added a more difficult optimization target for student to learn additional knowledge on unlabeled images that went through color jittering, distortions, Gaussian blurring and spatial distortion, so it can learn more invariant representations from them.
The architecture consists of <a href='DINOv2' target='_self'>DINOv2</a> encoder to extract the features followed by DPT decoder. At first, they train the teacher model on labelled images, and then they jointly train the student model and add in the dataset pseudo-labelled by ViT-L.
""", unsafe_allow_html=True)
st.markdown(""" """)
st.image("pages/Depth Anything/image_1.jpg", use_column_width=True)
st.markdown(""" """)
st.markdown("""Thanks to this, Depth Anything performs very well! I have also benchmarked the inference duration of the model against different models here. I also ran `torch.compile` benchmarks across them and got nice speed-ups πŸš€
On T4 GPU, mean of 30 inferences for each. Inferred using `pipeline` (pre-processing and post-processing included with model inference).
| Model/Batch Size | 16 | 4 | 1 |
| ----------------------------- | --------- | -------- | ------- |
| intel/dpt-large | 2709.652 | 667.799 | 172.617 |
| facebook/dpt-dinov2-small-nyu | 2534.854 | 654.822 | 159.754 |
| facebook/dpt-dinov2-base-nyu | 4316.8733 | 1090.824 | 266.699 |
| Intel/dpt-beit-large-512 | 7961.386 | 2036.743 | 497.656 |
| depth-anything-small | 1692.368 | 415.915 | 143.379 |
`torch.compile`’s benchmarks with reduce-overhead mode: we have compiled the model and loaded it to the pipeline for the benchmarks to be fair.
| Model/Batch Size | 16 | 4 | 1 |
| ----------------------------- | -------- | -------- | ------- |
| intel/dpt-large | 2556.668 | 645.750 | 155.153 |
| facebook/dpt-dinov2-small-nyu | 2415.25 | 610.967 | 148.526 |
| facebook/dpt-dinov2-base-nyu | 4057.909 | 1035.672 | 245.692 |
| Intel/dpt-beit-large-512 | 7417.388 | 1795.882 | 426.546 |
| depth-anything-small | 1664.025 | 384.688 | 97.865 |
""")
st.markdown(""" """)
st.image("pages/Depth Anything/image_2.jpg", use_column_width=True)
st.markdown(""" """)
st.markdown("""
You can use Depth Anything easily thanks to πŸ€— Transformers with three lines of code! ✨
We have also built an app for you to [compare different depth estimation models](https://t.co/6uq4osdwWG) 🐝 🌸
See all the available Depth Anything checkpoints [here](https://t.co/Ex0IIyx7XC).
""")
st.markdown(""" """)
st.image("pages/Depth Anything/image_3.jpg", use_column_width=True)
st.markdown(""" """)
st.info("""
Ressources:
[Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891)
by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao (2024)
[GitHub](https://github.com/LiheYoung/Depth-Anything)
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/depth_anything)""", icon="πŸ“š")
st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3 = st.columns(3)
with col1:
if st.button('Previous paper', use_container_width=True):
switch_page("Backbone")
with col2:
if st.button('Home', use_container_width=True):
switch_page("Home")
with col3:
if st.button('Next paper', use_container_width=True):
switch_page("LLaVA-NeXT")