JianyuanWang commited on
Commit
73b0ef2
·
1 Parent(s): e982865
Files changed (1) hide show
  1. README.md +14 -177
README.md CHANGED
@@ -1,177 +1,14 @@
1
- <div align="center">
2
- <h1>VGGT: Visual Geometry Grounded Transformer</h1>
3
-
4
- <a href=""><img src='https://img.shields.io/badge/arXiv-VGGT' alt='Paper PDF'></a>
5
- <a href=''><img src='https://img.shields.io/badge/Project_Page-green' alt='Project Page'></a>
6
- <a href='https://huggingface.co/spaces/facebook/vggt'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'></a>
7
-
8
-
9
- **[Meta AI Research, GenAI](https://ai.facebook.com/research/)**; **[University of Oxford, VGG](https://www.robots.ox.ac.uk/~vgg/)**
10
-
11
-
12
- [Jianyuan Wang](https://jytime.github.io/), [Minghao Chen](https://silent-chen.github.io/), [Nikita Karaev](https://nikitakaraevv.github.io/), [Andrea Vedaldi](https://www.robots.ox.ac.uk/~vedaldi/), [Christian Rupprecht](https://chrirupp.github.io/), [David Novotny](https://d-novotny.github.io/)
13
- </div>
14
-
15
- ## Overview
16
-
17
- Visual Geometry Grounded Transformer (VGGT, CVPR 2025) is a feed-forward neural network that directly infers all key 3D attributes of a scene, including extrinsic and intrinsic camera parameters, point maps, depth maps, and 3D point tracks, **from one, a few, or hundreds of its views**, within seconds.
18
-
19
-
20
- ## Quick Start
21
-
22
- First, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).
23
-
24
- ```bash
25
- git clone git@github.com:facebookresearch/vggt.git
26
- cd vggt
27
- pip install -r requirements.txt
28
- ```
29
-
30
-
31
- Now, try the model with just a few lines of code:
32
-
33
- ```python
34
- import torch
35
- from vggt.models.vggt import VGGT
36
- from vggt.utils.load_fn import load_and_preprocess_images
37
-
38
- device = "cuda" if torch.cuda.is_available() else "cpu"
39
- dtype = torch.bfloat16 # or torch.float16
40
-
41
- # Initialize the model and load the pretrained weights.
42
- # This will automatically download the model weights the first time it's run, which may take a while.
43
- model = VGGT.from_pretrained("facebook/VGGT-1B").to(device)
44
-
45
- # Load and preprocess example images (replace with your own image paths)
46
- image_names = ["examples/kitchen/images/00.png", "examples/kitchen/images/01.png", "examples/kitchen/images/02.png"]
47
- images = load_and_preprocess_images(image_names).to(device)
48
-
49
- with torch.no_grad() and torch.cuda.amp.autocast(dtype=dtype):
50
- # Predict attributes including cameras, depth maps, and point maps.
51
- predictions = model(images)
52
- ```
53
-
54
- The model weights will be automatically downloaded from Hugging Face. If you encounter issues such as slow loading, you can manually download them [here](https://huggingface.co/facebook/VGGT-1B/blob/main/model.pt) and load, or:
55
-
56
- ```python
57
- model = VGGT()
58
- _URL = "https://huggingface.co/facebook/VGGT-1B/resolve/main/model.pt"
59
- model.load_state_dict(torch.hub.load_state_dict_from_url(_URL))
60
- ```
61
-
62
- ## Detailed Usage
63
-
64
- You can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.
65
-
66
- ```python
67
- from vggt.utils.pose_enc import pose_encoding_to_extri_intri
68
- from vggt.utils.geometry import unproject_depth_map_to_point_map
69
-
70
- with torch.no_grad():
71
- with torch.cuda.amp.autocast(dtype=dtype):
72
- images = images[None] # add batch dimension
73
- aggregated_tokens_list, ps_idx = model.aggregator(images)
74
-
75
- # Predict Cameras
76
- pose_enc = model.camera_head(aggregated_tokens_list)[-1]
77
- # Extrinsic and intrinsic matrices, following OpenCV convention (camera from world)
78
- extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:])
79
-
80
- # Predict Depth Maps
81
- depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)
82
-
83
- # Predict Point Maps
84
- point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)
85
-
86
- # Construct 3D Points from Depth Maps and Cameras
87
- # which usually leads to more accurate 3D points than point map branch
88
- point_map_by_unprojection = unproject_depth_map_to_point_map(depth_map.squeeze(0),
89
- extrinsic.squeeze(0),
90
- intrinsic.squeeze(0))
91
-
92
- # Predict Tracks
93
- # choose your own points to track, with shape (N, 2) for one scene
94
- query_points = torch.FloatTensor([[100.0, 200.0],
95
- [60.72, 259.94]]).to(device)
96
- track_list, vis_score, conf_score = model.track_head(aggregated_tokens_list, images, ps_idx, query_points=query_points[None])
97
- ```
98
-
99
-
100
- ## Visualization
101
-
102
- We provide multiple ways to visualize your 3D reconstructions and tracking results. Before using these visualization tools, install the required dependencies:
103
-
104
- ```bash
105
- pip install -r requirements_demo.txt
106
- ```
107
-
108
- ### Interactive 3D Visualization
109
-
110
- #### Gradio Web Interface
111
-
112
- Our Gradio-based interface allows you to upload images/videos, run reconstruction, and interactively explore the 3D scene in your browser:
113
-
114
-
115
- ```bash
116
- python demo_gradio.py
117
- ```
118
-
119
-
120
- #### Viser 3D Viewer
121
-
122
- Run the following command to run reconstruction and visualize the point clouds in viser. Note this script requires a path to a folder containing images. It assumes only image files under the folder. You can set `--use_point_map` to use the point cloud from the point map branch, instead of the depth-based point cloud.
123
-
124
- ```bash
125
- python demo_viser.py --image_folder path/to/your/images/folder
126
- ```
127
-
128
-
129
- ### Track Visualization
130
-
131
- To visualize point tracks across multiple images:
132
-
133
- ```python
134
- from vggt.utils.visual_track import visualize_tracks_on_images
135
- track = track_list[-1]
136
- visualize_tracks_on_images(images, track, vis_score>0.5, out_dir="track_visuals")
137
- ```
138
- This plots the tracks on the images and saves them to the specified output directory.
139
-
140
-
141
- ## Single-view Reconstruction
142
-
143
- Our model shows surprisingly good performance on single-view reconstruction, although it was never trained for this task. The model does not need to duplicate the single-view image to a pair, instead, it can directly infer the 3D structure from the tokens of the single view image. Feel free to try it with our demos above, which naturally works for single-view reconstruction.
144
-
145
-
146
- ## Runtime and GPU Memory
147
-
148
- We benchmark the runtime and GPU memory usage of VGGT's aggregator on a single NVIDIA H100 GPU across various input sizes.
149
-
150
- | **Input Frames** | 1 | 2 | 4 | 8 | 10 | 20 | 50 | 100 | 200 |
151
- |:----------------:|:-:|:-:|:-:|:-:|:--:|:--:|:--:|:---:|:---:|
152
- | **Time (s)** | 0.04 | 0.05 | 0.07 | 0.11 | 0.14 | 0.31 | 1.04 | 3.12 | 8.75 |
153
- | **Memory (GB)** | 1.88 | 2.07 | 2.45 | 3.23 | 3.63 | 5.58 | 11.41 | 21.15 | 40.63 |
154
-
155
- Note that these results were obtained using Flash Attention 3, which is faster than the default Flash Attention 2 implementation while maintaining almost the same memory usage. Feel free to compile Flash Attention 3 from source to get better performance.
156
-
157
-
158
- ## Checklist
159
-
160
- - [ ] Release the training code
161
- - [ ] Release VGGT-500M and VGGT-200M
162
-
163
-
164
- ## License
165
- See the [LICENSE](./LICENSE.txt) file for details about the license under which this code is made available.
166
-
167
- ## Citation
168
- If you find our repository useful, please consider giving it a star ⭐ and citing our paper in your work:
169
-
170
- ```bibtex
171
- @inproceedings{wang2025vggt,
172
- title={VGGT: Visual Geometry Grounded Transformer},
173
- author={Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David},
174
- booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
175
- year={2025}
176
- }
177
- ```
 
1
+ ---
2
+ title: vggt
3
+ emoji: 🏆
4
+ colorFrom: indigo
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 5.17.1
8
+ app_file: app.py
9
+ pinned: false
10
+ license: cc-by-nc-4.0
11
+ short_description: vggt (alpha test)
12
+ ---
13
+
14
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference