bconsolvo Chesebrough commited on
Commit
cea2081
1 Parent(s): 29da776

Update Readme (#3)

Browse files

- Update Readme (60ae5f3828c8d7f854f963a35db40a2bc45dfe65)


Co-authored-by: bob chesebrough <Chesebrough@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +123 -7
README.md CHANGED
@@ -1,21 +1,93 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
4
 
5
- # DPT 3.1 (BEiT backbone)
 
 
6
 
7
- DPT (Dense Prediction Transformer) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. (2021) and first released in [this repository](https://github.com/isl-org/MiDaS/tree/master).
8
-
9
- Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.
10
 
11
  ## Model description
12
 
13
  This DPT model uses the [BEiT](https://huggingface.co/docs/transformers/model_doc/beit) model as backbone and adds a neck + head on top for monocular depth estimation.
14
-
15
  ![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg)
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ## How to use
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  Here is how to use this model for zero-shot depth estimation on an image:
20
 
21
  ```python
@@ -25,8 +97,12 @@ import numpy as np
25
  from PIL import Image
26
  import requests
27
 
 
28
  url = "http://images.cocodataset.org/val2017/000000039769.jpg"
29
  image = Image.open(requests.get(url, stream=True).raw)
 
 
 
30
 
31
  processor = DPTImageProcessor.from_pretrained("Intel/dpt-beit-large-384")
32
  model = DPTForDepthEstimation.from_pretrained("Intel/dpt-beit-large-384")
@@ -50,6 +126,7 @@ prediction = torch.nn.functional.interpolate(
50
  output = prediction.squeeze().cpu().numpy()
51
  formatted = (output * 255 / np.max(output)).astype("uint8")
52
  depth = Image.fromarray(formatted)
 
53
  ```
54
 
55
  or one can use the pipeline API:
@@ -58,6 +135,45 @@ or one can use the pipeline API:
58
  from transformers import pipeline
59
 
60
  pipe = pipeline(task="depth-estimation", model="Intel/dpt-beit-large-384")
61
- result = pipe("http://images.cocodataset.org/val2017/000000039769.jpg")
62
  result["depth"]
63
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - vision
5
+ - depth-estimation
6
+
7
+ model-index:
8
+ - name: dpt-beit-large-384
9
+ results:
10
+ - task:
11
+ type: monocular-depth-estimation
12
+ name: Monocular Depth Estimation
13
+ dataset:
14
+ type: MIX-6
15
+ name: MIX-6
16
+ metrics:
17
+ - type: Zero-shot transfer
18
+ value: 10.82
19
+ name: Zero-shot transfer
20
+ config: Zero-shot transfer
21
+ verified: false
22
  ---
23
+ # Overview of Monocular depth estimation and BEiT
24
+ The Intel/dpt-beit-large-384 model is based on Monocular depth estimation with the BEiT backbone. Monocular depth estimation, aiming to infer detailed depth from a single image or camera view, finds applications in fields like generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the underconstrained nature of the problem. Recent advancements attribute progress to learning-based methods, particularly with MiDaS, leveraging dataset mixing and scale-and-shift-invariant loss. MiDaS has evolved with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, including those pioneered by models like ViT, there's been a shift towards using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer-based encoders alongside traditional convolutional ones, aiming for a comprehensive investigation of depth estimation techniques. The paper focuses on describing the integration of these backbones into MiDaS, providing a thorough comparison of different v3.1 models, and offering guidance on utilizing future backbones with MiDaS.
25
 
26
+ | Input Image | Output Depth Image |
27
+ | --- | --- |
28
+ | ![input image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/PDwRwuryaO3YtuyRjraiM.jpeg) | ![Depth image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/ugqri6LcqJBuU9zI9aeqN.jpeg) |
29
 
 
 
 
30
 
31
  ## Model description
32
 
33
  This DPT model uses the [BEiT](https://huggingface.co/docs/transformers/model_doc/beit) model as backbone and adds a neck + head on top for monocular depth estimation.
 
34
  ![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg)
35
 
36
+ The previous release MiDaS v3.0 solely leverages the
37
+ vanilla vision transformer ViT, MiDaS v3.1 offers additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT.
38
+
39
+ # DPT 3.1 (BEiT backbone): focus on BEiT384-L
40
+
41
+ The highest quality depth estimation is achieved using the BEiT transformer. We provide variants such as BEiT512-L, BEiT384-L, and BEiT384-B, where the numbers signify training resolutions of 512x512 and 384x384, while the letters denote large and base models respectively. Although newer versions like BEiT v2 and BEiT-3 exist, they were not explored in our study. BEiT v2 lacked pretrained checkpoints with resolutions of 384x384 or higher, only offering them at 224x224. BEiT-3 was released after our study was completed.
42
+
43
+ DPT (Dense Prediction Transformer) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. (2021) and first released in [this repository](https://github.com/isl-org/MiDaS/tree/master).
44
+
45
+ This model card refers specifically to **BEiT384-L** in the paper, and is referred to dpt-beit-large-384. A more recent paper from 2013, specifically discussing BEit, is in this paper [MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation
46
+ ](https://arxiv.org/pdf/2307.14460.pdf)
47
+
48
+ The model card has been written in combination by the Hugging Face team and Intel.
49
+
50
+ | Model Detail | Description |
51
+ | ----------- | ----------- |
52
+ | Model Authors - Company | Intel |
53
+ | Date | March 7, 2024 |
54
+ | Version | 1 |
55
+ | Type | Computer Vision - Monocular Depth Estimation |
56
+ | Paper or Other Resources | [MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation](https://arxiv.org/pdf/2307.14460.pdf) and [GitHub Repo](https://github.com/isl-org/MiDaS/blob/master/README.md) |
57
+ | License | MIT |
58
+ | Questions or Comments | [Community Tab](https://huggingface.co/Intel/dpt-beit-large-384/discussions) and [Intel Developers Discord](https://discord.gg/rv2Gp55UJQ)|
59
+
60
+ | Intended Use | Description |
61
+ | ----------- | ----------- |
62
+ | Primary intended uses | You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt-beit-large) to look for fine-tuned versions on a task that interests you. |
63
+ | Primary intended users | Anyone doing monocular depth estimation |
64
+ | Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.|
65
+
66
+
67
  ## How to use
68
 
69
+ Be sure the to update PyTorch as Transformers as mismatches in versions can generate erros such as: "TypeError: unsupported operand type(s) for //: 'NoneType' and 'NoneType'".
70
+
71
+ As tested by this contributor, the following versions ran correctly:
72
+ ```python
73
+ import torch
74
+ import transformers
75
+ print(torch.__version__)
76
+ print(transformers.__version__)
77
+ ```
78
+ ```bash
79
+ out: '2.2.1+cpu'
80
+ out: '4.37.2'
81
+ ```
82
+
83
+ ### To Install:
84
+
85
+ ```pythopn
86
+ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
87
+
88
+ ```
89
+
90
+ # To Use:
91
  Here is how to use this model for zero-shot depth estimation on an image:
92
 
93
  ```python
 
97
  from PIL import Image
98
  import requests
99
 
100
+ # retrieve image remotely
101
  url = "http://images.cocodataset.org/val2017/000000039769.jpg"
102
  image = Image.open(requests.get(url, stream=True).raw)
103
+ # retrieve image local file
104
+ # path = "../image/000000039769.jpg"
105
+ # image = Image.open(path)
106
 
107
  processor = DPTImageProcessor.from_pretrained("Intel/dpt-beit-large-384")
108
  model = DPTForDepthEstimation.from_pretrained("Intel/dpt-beit-large-384")
 
126
  output = prediction.squeeze().cpu().numpy()
127
  formatted = (output * 255 / np.max(output)).astype("uint8")
128
  depth = Image.fromarray(formatted)
129
+ depth
130
  ```
131
 
132
  or one can use the pipeline API:
 
135
  from transformers import pipeline
136
 
137
  pipe = pipeline(task="depth-estimation", model="Intel/dpt-beit-large-384")
138
+ result = pipe("http://images.cocodataset.org/val2017/000000181816.jpg")
139
  result["depth"]
140
+ ```
141
+
142
+ ## Quantitative Analyses
143
+ | Model | Square Resolution HRWSI RMSE | Square Resolution Blended MVS REL | Square Resolution ReDWeb RMSE |
144
+ | --- | --- | --- | --- |
145
+ | BEiT 384-L | 0.068 | 0.070 | 0.076 |
146
+ | Swin-L Training 1| 0.0708 | 0.0724 | 0.0826 |
147
+ | Swin-L Training 2 | 0.0713 | 0.0720 | 0.0831 |
148
+ | ViT-L | 0.071 | 0.072 | 0.082 |
149
+ | --- | --- | --- | --- |
150
+ | Next-ViT-L-1K-6M | 0.075 |0.073 | 0.085 |
151
+ | DeiT3-L-22K-1K | 0.070 | 0.070 | 0.080 |
152
+ | ViT-L-Hybrid | 0.075 | 0.075 | 0.085 |
153
+ | DeiT3-L | 0.077 | 0.075 | 0.087 |
154
+ | --- | --- | --- | --- |
155
+ | ConvNeXt-XL | 0.075 | 0.075 | 0.085 |
156
+ | ConvNeXt-L | 0.076 | 0.076 | 0.087 |
157
+ | EfficientNet-L2| 0.165 | 0.277 | 0.219 |
158
+ | --- | --- | --- | --- |
159
+ | ViT-L Reversed | 0.071 | 0.073 | 0.081 |
160
+ | Swin-L Equidistant | 0.072 | 0.074 | 0.083 |
161
+ | --- | --- | --- | --- |
162
+
163
+ ### BibTeX entry and citation info
164
+
165
+ ```bibtex
166
+ @article{DBLP:journals/corr/abs-2103-13413,
167
+ author = {Ren{\'{e}} Reiner Birkl, Diana Wofk, Matthias Muller},
168
+ title = {MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation},
169
+ journal = {CoRR},
170
+ volume = {abs/2307.14460},
171
+ year = {2021},
172
+ url = {https://arxiv.org/abs/2307.14460},
173
+ eprinttype = {arXiv},
174
+ eprint = {2307.14460},
175
+ timestamp = {Wed, 26 Jul 2023},
176
+ biburl = {https://dblp.org/rec/journals/corr/abs-2307.14460.bib},
177
+ bibsource = {dblp computer science bibliography, https://dblp.org}
178
+ }
179
+ ```