Unlocking Dense Metric Depth Estimation in VLMs
Paper • 2605.15876 • Published • 9
How to use JonnyYu828/DepthVLM-4B with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("depth-estimation", model="JonnyYu828/DepthVLM-4B") # Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("JonnyYu828/DepthVLM-4B")
model = AutoModelForImageTextToText.from_pretrained("JonnyYu828/DepthVLM-4B")Update 2026-05-18 (v1.0): Initial release
DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability.
Please refer to the official repository for detailed instructions on:
If you find this work useful, please cite:
@article{yu2026unlocking,
title={Unlocking Dense Metric Depth Estimation in VLMs},
author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
journal={arXiv preprint arXiv:2605.15876},
year={2026}
}
Base model
Qwen/Qwen3-VL-4B-Instruct