Duino commited on
Commit
a5bebc1
·
verified ·
1 Parent(s): fddaf85

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +230 -3
README.md CHANGED
@@ -1,3 +1,230 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ```yaml
2
+ ---
3
+ title: Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment
4
+ emoji: ✨
5
+ colorFrom: indigo
6
+ colorTo: blue
7
+ sdk: gradio
8
+ sdk_version: 4.x
9
+ app_file: app.py # Replace with your actual Gradio app file name if different
10
+ tags:
11
+ - 3d-mapping
12
+ - indoor-reconstruction
13
+ - depth-estimation
14
+ - semantic-segmentation
15
+ - vision-language-model
16
+ - mobile-video
17
+ - gradio
18
+ - point-cloud
19
+ - dpt
20
+ - paligemma
21
+ - computer-vision
22
+ - research-paper
23
+ license: mit
24
+ ---
25
+
26
+ # Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment
27
+
28
+ Welcome to the Hugging Face Space for Duino-Idar, an innovative system transforming mobile video into interactive 3D room scans with semantic understanding. This Space provides access to the research paper, code snippets, and a conceptual demonstration outlining the system's capabilities.
29
+
30
+ ## Abstract
31
+
32
+ Duino-Idar presents an end-to-end pipeline for creating interactive 3D maps of indoor spaces from mobile video. It leverages state-of-the-art monocular depth estimation using DPT (Dense Prediction Transformer) models and enhances these geometric reconstructions with semantic context via a fine-tuned vision-language model, PaLiGemma. The system processes video by extracting key frames, estimating depth for each frame, constructing a 3D point cloud, and overlaying semantic labels. A user-friendly Gradio interface is designed for video upload, processing initiation, and interactive exploration of the resulting 3D scenes. This research details the system architecture, key mathematical formulations, implementation highlights, and potential applications in areas like augmented reality, indoor navigation, and automated scene understanding. Future work envisions incorporating LiDAR data for improved accuracy and real-time performance.
33
+
34
+ ## Key Features
35
+
36
+ * **Mobile Video Input:** Accepts standard mobile-recorded videos, making data capture straightforward and accessible.
37
+ * **DPT-Based Depth Estimation:** Employs Dense Prediction Transformer (DPT) models for accurate depth inference from single video frames.
38
+ * **PaLiGemma Semantic Enrichment:** Integrates a fine-tuned PaLiGemma vision-language model to provide semantic annotations, enriching the 3D scene with object labels and contextual understanding.
39
+ * **Interactive 3D Point Clouds:** Generates 3D point clouds visualized with Open3D (or Plotly for web-based alternatives), allowing users to rotate, zoom, and pan through the reconstructed scene.
40
+ * **Gradio Web Interface:** Features a user-friendly Gradio GUI for seamless video upload, processing, and interactive 3D model exploration directly in the browser.
41
+ * **Mathematically Grounded Approach:** Based on established mathematical principles, including the pinhole camera model for 3D reconstruction and cross-entropy loss for vision-language model training.
42
+ * **Open-Source Code Snippets:** Key code implementations for depth estimation, 3D reconstruction, and Gradio interface are provided for transparency and reproducibility.
43
+
44
+ ## System Architecture
45
+
46
+ Duino-Idar operates through a modular pipeline (detailed in the full paper):
47
+
48
+ 1. **Video Processing & Frame Extraction:** Extracts representative key frames from the input video using OpenCV.
49
+ 2. **Depth Estimation (DPT):** Utilizes a pre-trained DPT model from Hugging Face Transformers to predict depth maps from each extracted frame.
50
+ 3. **3D Reconstruction (Pinhole Model):** Converts depth maps into 3D point clouds using a pinhole camera model approximation.
51
+ 4. **Semantic Enrichment (PaLiGemma):** Employs a fine-tuned PaLiGemma model to generate semantic labels and scene descriptions for key frames.
52
+ 5. **Interactive Visualization (Open3D/Gradio):** Visualizes the resulting semantically enhanced 3D point cloud within an interactive Gradio interface using Open3D for rendering.
53
+
54
+ For a comprehensive understanding of the system architecture, refer to Section 3 of the full research paper linked below.
55
+
56
+ ## Mathematical Foundations
57
+
58
+ Duino-Idar's core components are underpinned by the following mathematical principles:
59
+
60
+ **1. Depth Estimation & Normalization:**
61
+
62
+ The depth map $D$ is predicted by the DPT model $f$:
63
+
64
+ $D = f(I; \theta)$
65
+
66
+ Where $I$ is the input image and $\theta$ represents the model parameters. The depth map is then normalized:
67
+
68
+ $D_{\text{norm}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)}$
69
+
70
+ And optionally scaled to an 8-bit range for visualization:
71
+
72
+ $D_{\text{scaled}}(u,v) = D_{\text{norm}}(u,v) \times 255$
73
+
74
+ **2. 3D Reconstruction with Pinhole Camera Model:**
75
+
76
+ Using intrinsic camera parameters (focal lengths $f_x, f_y$ and principal point $c_x, c_y$), and depth $z(u,v)$ for a pixel $(u,v)$, the 3D coordinates $(x, y, z)$ are calculated as:
77
+
78
+ $x = \frac{(u - c_x) \cdot z(u,v)}{f_x}$
79
+
80
+ $y = \frac{(v - c_y) \cdot z(u,v)}{f_y}$
81
+
82
+ $z = z(u,v)$
83
+
84
+ In matrix form:
85
+
86
+ $\begin{pmatrix} x \\ y \\ z \end{pmatrix} = z(u,v) \cdot K^{-1} \begin{pmatrix} u \\ v \\ 1 \end{pmatrix}$
87
+
88
+ Where $K$ is the intrinsic matrix:
89
+
90
+ $K = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix}$
91
+
92
+ **3. Point Cloud Aggregation:**
93
+
94
+ Point clouds $P_i$ from individual frames are aggregated into a final point cloud $P$:
95
+
96
+ $P = \bigcup_{i=1}^{M} P_i$
97
+
98
+ **4. PaLiGemma Fine-Tuning Loss:**
99
+
100
+ The PaLiGemma model is fine-tuned to minimize the cross-entropy loss $\mathcal{L}$ for predicting caption tokens $c = (c_1, ..., c_T)$ given an image $I$:
101
+
102
+ $\mathcal{L} = -\sum_{t=1}^{T} \log P(c_t \mid c_{<t}, I)$
103
+
104
+ For a more detailed mathematical treatment, please refer to Section 4.1 of the full research paper.
105
+
106
+ ## Code Snippets
107
+
108
+ Here are key Python code snippets illustrating the core functionalities of Duino-Idar:
109
+
110
+ **1. Depth Estimation with DPT (using Hugging Face Transformers):**
111
+
112
+ ```python
113
+ import torch
114
+ from transformers import DPTFeatureExtractor, DPTForDepthEstimation
115
+ from PIL import Image
116
+ import numpy as np
117
+
118
+ dpt_model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
119
+ feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
120
+
121
+ def estimate_depth(image):
122
+ inputs = feature_extractor(images=image, return_tensors="pt")
123
+ with torch.no_grad():
124
+ depth_map = dpt_model(**inputs).predicted_depth.squeeze().numpy()
125
+ depth_map = (depth_map / np.max(depth_map) * 255).astype(np.uint8) # Normalize
126
+ return depth_map
127
+
128
+ # Example usage:
129
+ image = Image.open("example_frame.jpg") # Replace with your image path
130
+ depth_map = estimate_depth(image)
131
+ # depth_map is a NumPy array representing the depth
132
+ ```
133
+
134
+ **2. 3D Point Cloud Reconstruction (using Open3D):**
135
+
136
+ ```python
137
+ import open3d as o3d
138
+ import numpy as np
139
+
140
+ def reconstruct_3d(depth_map, image):
141
+ h, w = depth_map.shape
142
+ fx = fy = max(h, w) / 2.0 # Approximate intrinsics
143
+ cx, cy = w / 2.0, h / 2.0
144
+ points = []
145
+ colors = []
146
+ image_np = np.array(image) / 255.0
147
+
148
+ for v in range(h):
149
+ for u in range(w):
150
+ z = depth_map[v, u] / 255.0 * 5.0 # Scaled depth
151
+ x = (u - cx) * z / fx
152
+ y = (v - cy) * z / fy
153
+ points.append([x, y, z])
154
+ colors.append(image_np[v, u])
155
+
156
+ pcd = o3d.geometry.PointCloud()
157
+ pcd.points = o3d.utility.Vector3dVector(np.array(points))
158
+ pcd.colors = o3d.utility.Vector3dVector(np.array(colors))
159
+ return pcd
160
+
161
+ # Example usage:
162
+ # point_cloud = reconstruct_3d(depth_map, image)
163
+ # o3d.io.write_point_cloud("output.ply", point_cloud)
164
+ ```
165
+
166
+ **3. Basic Gradio Interface (for demonstration - full interface in paper):**
167
+
168
+ ```python
169
+ import gradio as gr
170
+ import open3d as o3d
171
+
172
+ def visualize_3d_model(ply_file):
173
+ pcd = o3d.io.read_point_cloud(ply_file)
174
+ o3d.visualization.draw_geometries([pcd])
175
+
176
+ with gr.Blocks() as demo:
177
+ gr.Markdown("### Duino-Idar 3D Mapping Demo")
178
+ video_input = gr.Video(label="Upload Video", type="filepath")
179
+ process_btn = gr.Button("Process & Visualize")
180
+ # ... (Integration of processing functions would go here in a full app) ...
181
+
182
+ process_btn.click(fn=lambda x: "output.ply", inputs=video_input, outputs=None) # Placeholder function
183
+
184
+ demo.launch()
185
+ ```
186
+
187
+ These snippets provide a glimpse into the dynamic code implementation of Duino-Idar. For the complete implementation and Gradio application, please refer to the code examples within the full research paper (Section 4.3).
188
+
189
+ ## Interactive Demo
190
+
191
+ [**Conceptual Gradio Demo Space (Under Development)**](https://huggingface.co/spaces/Duino/duino-idar) **(Replace with your actual Space URL once created)**
192
+
193
+ While a fully interactive demo is currently under development, the linked Space will eventually host a Gradio application allowing you to upload your own mobile videos and visualize the generated 3D point clouds. Please check back for updates!
194
+
195
+ ## Future Work
196
+
197
+ Future development of Duino-Idar will focus on:
198
+
199
+ * **Enhanced Semantic Integration:** Implementing robust semantic label overlay directly onto the point cloud geometry.
200
+ * **Multi-Frame Fusion & SLAM:** Incorporating SLAM or multi-view stereo techniques for improved reconstruction accuracy and handling camera motion.
201
+ * **LiDAR Integration (Duino-*Idar* Vision):** Exploring the fusion of LiDAR data to complement video-based depth estimation for greater precision and robustness.
202
+ * **Real-Time Performance Optimization:** Optimizing the pipeline for real-time or near-real-time 3D mapping on mobile platforms.
203
+ * **Advanced User Interface:** Developing a more immersive and feature-rich user interface for interactive 3D scene exploration and manipulation.
204
+
205
+ ## Citation
206
+
207
+ If you utilize Duino-Idar in your research, please cite the following:
208
+
209
+ ```bibtex
210
+ @misc{duino-idar-2024,
211
+ author = {Jalal Mansour (Jalal Duino)},
212
+ title = {Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment},
213
+ year = {20245},
214
+ publisher = {Hugging Face Space},
215
+ howpublished = {Online},
216
+ url = {https://huggingface.co/spaces/Duino/duino-idar}
217
+ }
218
+ ```
219
+
220
+ ## Contact
221
+
222
+ For inquiries, collaborations, or further information, please contact:
223
+
224
+ Jalal Mansour (Jalal Duino) - [Jalalmansour663@gmail.com](mailto:Jalalmansour663@gmail.com)
225
+
226
+ ## Full Research Paper
227
+
228
+ [**Link to Full Research Paper (PDF)**](Link to your full paper PDF here - e.g., Google Drive, Dropbox, personal website) **(Replace with actual link to your paper)**
229
+
230
+ This README.md provides a comprehensive overview of the Duino-Idar project. We encourage you to read the full research paper for in-depth details and stay tuned for updates on the interactive demo Space!