Spaces:
Running
on
Zero
Running
on
Zero
Commit
·
73b0ef2
1
Parent(s):
e982865
format
Browse files
README.md
CHANGED
@@ -1,177 +1,14 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
## Overview
|
16 |
-
|
17 |
-
Visual Geometry Grounded Transformer (VGGT, CVPR 2025) is a feed-forward neural network that directly infers all key 3D attributes of a scene, including extrinsic and intrinsic camera parameters, point maps, depth maps, and 3D point tracks, **from one, a few, or hundreds of its views**, within seconds.
|
18 |
-
|
19 |
-
|
20 |
-
## Quick Start
|
21 |
-
|
22 |
-
First, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).
|
23 |
-
|
24 |
-
```bash
|
25 |
-
git clone git@github.com:facebookresearch/vggt.git
|
26 |
-
cd vggt
|
27 |
-
pip install -r requirements.txt
|
28 |
-
```
|
29 |
-
|
30 |
-
|
31 |
-
Now, try the model with just a few lines of code:
|
32 |
-
|
33 |
-
```python
|
34 |
-
import torch
|
35 |
-
from vggt.models.vggt import VGGT
|
36 |
-
from vggt.utils.load_fn import load_and_preprocess_images
|
37 |
-
|
38 |
-
device = "cuda" if torch.cuda.is_available() else "cpu"
|
39 |
-
dtype = torch.bfloat16 # or torch.float16
|
40 |
-
|
41 |
-
# Initialize the model and load the pretrained weights.
|
42 |
-
# This will automatically download the model weights the first time it's run, which may take a while.
|
43 |
-
model = VGGT.from_pretrained("facebook/VGGT-1B").to(device)
|
44 |
-
|
45 |
-
# Load and preprocess example images (replace with your own image paths)
|
46 |
-
image_names = ["examples/kitchen/images/00.png", "examples/kitchen/images/01.png", "examples/kitchen/images/02.png"]
|
47 |
-
images = load_and_preprocess_images(image_names).to(device)
|
48 |
-
|
49 |
-
with torch.no_grad() and torch.cuda.amp.autocast(dtype=dtype):
|
50 |
-
# Predict attributes including cameras, depth maps, and point maps.
|
51 |
-
predictions = model(images)
|
52 |
-
```
|
53 |
-
|
54 |
-
The model weights will be automatically downloaded from Hugging Face. If you encounter issues such as slow loading, you can manually download them [here](https://huggingface.co/facebook/VGGT-1B/blob/main/model.pt) and load, or:
|
55 |
-
|
56 |
-
```python
|
57 |
-
model = VGGT()
|
58 |
-
_URL = "https://huggingface.co/facebook/VGGT-1B/resolve/main/model.pt"
|
59 |
-
model.load_state_dict(torch.hub.load_state_dict_from_url(_URL))
|
60 |
-
```
|
61 |
-
|
62 |
-
## Detailed Usage
|
63 |
-
|
64 |
-
You can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.
|
65 |
-
|
66 |
-
```python
|
67 |
-
from vggt.utils.pose_enc import pose_encoding_to_extri_intri
|
68 |
-
from vggt.utils.geometry import unproject_depth_map_to_point_map
|
69 |
-
|
70 |
-
with torch.no_grad():
|
71 |
-
with torch.cuda.amp.autocast(dtype=dtype):
|
72 |
-
images = images[None] # add batch dimension
|
73 |
-
aggregated_tokens_list, ps_idx = model.aggregator(images)
|
74 |
-
|
75 |
-
# Predict Cameras
|
76 |
-
pose_enc = model.camera_head(aggregated_tokens_list)[-1]
|
77 |
-
# Extrinsic and intrinsic matrices, following OpenCV convention (camera from world)
|
78 |
-
extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:])
|
79 |
-
|
80 |
-
# Predict Depth Maps
|
81 |
-
depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)
|
82 |
-
|
83 |
-
# Predict Point Maps
|
84 |
-
point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)
|
85 |
-
|
86 |
-
# Construct 3D Points from Depth Maps and Cameras
|
87 |
-
# which usually leads to more accurate 3D points than point map branch
|
88 |
-
point_map_by_unprojection = unproject_depth_map_to_point_map(depth_map.squeeze(0),
|
89 |
-
extrinsic.squeeze(0),
|
90 |
-
intrinsic.squeeze(0))
|
91 |
-
|
92 |
-
# Predict Tracks
|
93 |
-
# choose your own points to track, with shape (N, 2) for one scene
|
94 |
-
query_points = torch.FloatTensor([[100.0, 200.0],
|
95 |
-
[60.72, 259.94]]).to(device)
|
96 |
-
track_list, vis_score, conf_score = model.track_head(aggregated_tokens_list, images, ps_idx, query_points=query_points[None])
|
97 |
-
```
|
98 |
-
|
99 |
-
|
100 |
-
## Visualization
|
101 |
-
|
102 |
-
We provide multiple ways to visualize your 3D reconstructions and tracking results. Before using these visualization tools, install the required dependencies:
|
103 |
-
|
104 |
-
```bash
|
105 |
-
pip install -r requirements_demo.txt
|
106 |
-
```
|
107 |
-
|
108 |
-
### Interactive 3D Visualization
|
109 |
-
|
110 |
-
#### Gradio Web Interface
|
111 |
-
|
112 |
-
Our Gradio-based interface allows you to upload images/videos, run reconstruction, and interactively explore the 3D scene in your browser:
|
113 |
-
|
114 |
-
|
115 |
-
```bash
|
116 |
-
python demo_gradio.py
|
117 |
-
```
|
118 |
-
|
119 |
-
|
120 |
-
#### Viser 3D Viewer
|
121 |
-
|
122 |
-
Run the following command to run reconstruction and visualize the point clouds in viser. Note this script requires a path to a folder containing images. It assumes only image files under the folder. You can set `--use_point_map` to use the point cloud from the point map branch, instead of the depth-based point cloud.
|
123 |
-
|
124 |
-
```bash
|
125 |
-
python demo_viser.py --image_folder path/to/your/images/folder
|
126 |
-
```
|
127 |
-
|
128 |
-
|
129 |
-
### Track Visualization
|
130 |
-
|
131 |
-
To visualize point tracks across multiple images:
|
132 |
-
|
133 |
-
```python
|
134 |
-
from vggt.utils.visual_track import visualize_tracks_on_images
|
135 |
-
track = track_list[-1]
|
136 |
-
visualize_tracks_on_images(images, track, vis_score>0.5, out_dir="track_visuals")
|
137 |
-
```
|
138 |
-
This plots the tracks on the images and saves them to the specified output directory.
|
139 |
-
|
140 |
-
|
141 |
-
## Single-view Reconstruction
|
142 |
-
|
143 |
-
Our model shows surprisingly good performance on single-view reconstruction, although it was never trained for this task. The model does not need to duplicate the single-view image to a pair, instead, it can directly infer the 3D structure from the tokens of the single view image. Feel free to try it with our demos above, which naturally works for single-view reconstruction.
|
144 |
-
|
145 |
-
|
146 |
-
## Runtime and GPU Memory
|
147 |
-
|
148 |
-
We benchmark the runtime and GPU memory usage of VGGT's aggregator on a single NVIDIA H100 GPU across various input sizes.
|
149 |
-
|
150 |
-
| **Input Frames** | 1 | 2 | 4 | 8 | 10 | 20 | 50 | 100 | 200 |
|
151 |
-
|:----------------:|:-:|:-:|:-:|:-:|:--:|:--:|:--:|:---:|:---:|
|
152 |
-
| **Time (s)** | 0.04 | 0.05 | 0.07 | 0.11 | 0.14 | 0.31 | 1.04 | 3.12 | 8.75 |
|
153 |
-
| **Memory (GB)** | 1.88 | 2.07 | 2.45 | 3.23 | 3.63 | 5.58 | 11.41 | 21.15 | 40.63 |
|
154 |
-
|
155 |
-
Note that these results were obtained using Flash Attention 3, which is faster than the default Flash Attention 2 implementation while maintaining almost the same memory usage. Feel free to compile Flash Attention 3 from source to get better performance.
|
156 |
-
|
157 |
-
|
158 |
-
## Checklist
|
159 |
-
|
160 |
-
- [ ] Release the training code
|
161 |
-
- [ ] Release VGGT-500M and VGGT-200M
|
162 |
-
|
163 |
-
|
164 |
-
## License
|
165 |
-
See the [LICENSE](./LICENSE.txt) file for details about the license under which this code is made available.
|
166 |
-
|
167 |
-
## Citation
|
168 |
-
If you find our repository useful, please consider giving it a star ⭐ and citing our paper in your work:
|
169 |
-
|
170 |
-
```bibtex
|
171 |
-
@inproceedings{wang2025vggt,
|
172 |
-
title={VGGT: Visual Geometry Grounded Transformer},
|
173 |
-
author={Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David},
|
174 |
-
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
|
175 |
-
year={2025}
|
176 |
-
}
|
177 |
-
```
|
|
|
1 |
+
---
|
2 |
+
title: vggt
|
3 |
+
emoji: 🏆
|
4 |
+
colorFrom: indigo
|
5 |
+
colorTo: indigo
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 5.17.1
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
license: cc-by-nc-4.0
|
11 |
+
short_description: vggt (alpha test)
|
12 |
+
---
|
13 |
+
|
14 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|