camenduru commited on
Commit
c77cbf6
·
verified ·
1 Parent(s): c68420f

thanks to haoningwu ❤

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/SceneGen.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-to-3d
3
+ license: mit
4
+ language:
5
+ - en
6
+ ---
7
+
8
+ # SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
9
+
10
+ This repository contains the official PyTorch implementation of SceneGen: https://arxiv.org/abs/2508.15769/. Feel free to reach out for discussions!
11
+
12
+ **Now the Inference Code and Pretrained Models are released!**
13
+
14
+ <div align="center">
15
+ <img src="./assets/SceneGen.png">
16
+ </div>
17
+
18
+ ## 🌟 Some Information
19
+ [Project Page](https://mengmouxu.github.io/SceneGen/) · [Paper](https://arxiv.org/abs/2508.15769/) · [Checkpoints](https://huggingface.co/haoningwu/SceneGen/)
20
+
21
+ ## ⏩ News
22
+ - [2025.8] The inference code and checkpoints are released.
23
+ - [2025.8] Our pre-print paper has been released on arXiv.
24
+
25
+
26
+ ## 📦 Installation & Pretrained Models
27
+
28
+ ### Prerequisites
29
+ - **Hardware**: An NVIDIA GPU with at least 16GB of memory is necessary. The code has been verified on NVIDIA A100 and RTX 3090 GPUs.
30
+ - **Software**:
31
+ - The [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit-archive) is needed to compile certain submodules. The code has been tested with CUDA versions 12.1.
32
+ - Python version 3.8 or higher is required.
33
+
34
+ ### Installation Steps
35
+ 1. Clone the repo:
36
+ ```sh
37
+ git clone https://github.com/Mengmouxu/SceneGen.git
38
+ cd SceneGen
39
+ ```
40
+
41
+ 2. Install the dependencies:
42
+ Create a new conda environment named `scenegen` and install the dependencies:
43
+ ```sh
44
+ . ./setup.sh --new-env --basic --xformers --flash-attn --diffoctreerast --spconv --mipgaussian --kaolin --nvdiffrast --demo
45
+ ```
46
+ The detailed usage of `setup.sh` can be found by running `. ./setup.sh --help`.
47
+
48
+ ### Pretrained Models
49
+ 1. First, create a directory in the SceneGen folder to store the checkpoints:
50
+ ```sh
51
+ mkdir -p checkpoints
52
+ ```
53
+ 2. Download the pretrained models for **SAM2-Hiera-Large** and **VGGT-1B** from [SAM2](https://huggingface.co/facebook/sam2-hiera-large/) and [VGGT](https://huggingface.co/facebook/VGGT-1B/), then place them in the `checkpoints` directory. (**SAM2** installation and its checkpoints are required for interactive generation with segmentation.)
54
+ 3. Download our pretrained SceneGen model from [here](https://huggingface.co/haoningwu/SceneGen/) and place it in the `checkpoints` directory as follows:
55
+ ```
56
+ SceneGen/
57
+ ├── checkpoints/
58
+ │ ├── sam2-hiera-large
59
+ │ ├── VGGT-1B
60
+ │ └── scenegen
61
+ | ├──ckpts
62
+ | └──pipeline.json
63
+ └── ...
64
+ ```
65
+ ## 💡 Inference
66
+ We provide two scripts for inference: `inference.py` for batch processing and `interactive_demo.py` for an interactive Gradio demo.
67
+
68
+ ### Interactive Demo
69
+ This script launches a Gradio web interface for interactive scene generation.
70
+ - **Features**: It uses SAM2 for interactive image segmentation, allows for adjusting various generation parameters, and supports scene generation from single or multiple images.
71
+ - **Usage**:
72
+ ```sh
73
+ python interactive_demo.py
74
+ ```
75
+ > ## 🚀 Quick Start Guide
76
+ >
77
+ > ### 📷 Step 1: Input & Segment
78
+ > 1. **Upload your scene image.**
79
+ > 2. **Use the mouse to draw bounding boxes** around objects.
80
+ > 3. Click **"Run Segmentation"** to segment objects.
81
+ > > *※ For multi-image generation: maintain consistent object annotation order across all images.*
82
+ >
83
+ > ### 🗃️ Step 2: Manage Cache
84
+ > 1. Click **"Add to Cache"** when satisfied with the segmentation.
85
+ > 2. Repeat Step 1-2 for multiple images.
86
+ > 3. Use **"Delete Selected"** or **"Clear All"** to manage cached images.
87
+ >
88
+ > ### 🎮 Step 3: Generate Scene
89
+ > 1. Adjust generation parameters (optional).
90
+ > 2. Click **"Generate 3D Scene"**.
91
+ > 3. Download the generated GLB file when ready.
92
+ >
93
+ > **💡 Pro Tip:** Try the examples below to get started quickly!
94
+
95
+ ### Pre-segmented Image Inference
96
+ This script processes a directory of pre-segmented images.
97
+ - **Input**: The input folder structure should be similar to `assets/masked_image_test`, containing segmented scene images.
98
+ - **Visualization**: For scenes with ground truth data, you can use the `--gradio` flag to launch a Gradio interface that visualizes both the ground truth and the generated model. We provide data from the 3D-FUTURE test set as a demonstration.
99
+ - **Usage**:
100
+ ```sh
101
+ python inference.py --gradio
102
+ ```
103
+
104
+ ## 📚 Dataset
105
+ To be updated soon...
106
+
107
+ ## 🏋️‍♂️ Training
108
+ To be updated soon...
109
+
110
+ ## Evaluation
111
+ To be updated soon...
112
+
113
+ ## 📜 Citation
114
+ If you use this code and data for your research or project, please cite:
115
+
116
+ @article{meng2025scenegen,
117
+ author = {Meng, Yanxu and Wu, Haoning and Zhang, Ya and Xie, Weidi},
118
+ title = {SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass},
119
+ journal = {arXiv preprint arXiv:2508.15769},
120
+ year = {2025},
121
+ }
122
+
123
+ ## TODO
124
+ - [x] Release Paper
125
+ - [x] Release Checkpoints & Inference Code
126
+ - [ ] Release Training Code
127
+ - [ ] Release Evaluation Code
128
+ - [ ] Release Data Processing Code
129
+
130
+ ## Acknowledgements
131
+ Many thanks to the code bases from [TRELLIS](https://github.com/microsoft/TRELLIS), [DINOv2](https://github.com/facebookresearch/dinov2), and [VGGT](https://github.com/facebookresearch/vggt).
132
+
133
+ ## Contact
134
+ If you have any questions, please feel free to contact [meng-mou-xu@sjtu.edu.cn](mailto:meng-mou-xu@sjtu.edu.cn) and [haoningwu3639@gmail.com](mailto:haoningwu3639@gmail.com).
assets/SceneGen.png ADDED

Git LFS Details

  • SHA256: 2fd57a3df30a2a484dfce27ca9b65b2d4b819114b754a79486ba94026f8df1ce
  • Pointer size: 131 Bytes
  • Size of remote file: 144 kB
assets/icon.png ADDED
ckpts/slat_dec_gs_swin8_B_64l8gs32_fp16.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "SLatGaussianDecoder",
3
+ "args": {
4
+ "resolution": 64,
5
+ "model_channels": 768,
6
+ "latent_channels": 8,
7
+ "num_blocks": 12,
8
+ "num_heads": 12,
9
+ "mlp_ratio": 4,
10
+ "attn_mode": "swin",
11
+ "window_size": 8,
12
+ "use_fp16": true,
13
+ "representation_config": {
14
+ "lr": {
15
+ "_xyz": 1.0,
16
+ "_features_dc": 1.0,
17
+ "_opacity": 1.0,
18
+ "_scaling": 1.0,
19
+ "_rotation": 0.1
20
+ },
21
+ "perturb_offset": true,
22
+ "voxel_size": 1.5,
23
+ "num_gaussians": 32,
24
+ "2d_filter_kernel_size": 0.1,
25
+ "3d_filter_kernel_size": 9e-4,
26
+ "scaling_bias": 4e-3,
27
+ "opacity_bias": 0.1,
28
+ "scaling_activation": "softplus"
29
+ }
30
+ }
31
+ }
ckpts/slat_dec_gs_swin8_B_64l8gs32_fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38c84bcef5ce0af1f48b1b5558dabc7575a13346043c41a7e0610f1fa619a161
3
+ size 171450952
ckpts/slat_dec_mesh_swin8_B_64l8m256c_fp16.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "SLatMeshDecoder",
3
+ "args": {
4
+ "resolution": 64,
5
+ "model_channels": 768,
6
+ "latent_channels": 8,
7
+ "num_blocks": 12,
8
+ "num_heads": 12,
9
+ "mlp_ratio": 4,
10
+ "attn_mode": "swin",
11
+ "window_size": 8,
12
+ "use_fp16": true,
13
+ "representation_config": {
14
+ "use_color": true
15
+ }
16
+ }
17
+ }
ckpts/slat_dec_mesh_swin8_B_64l8m256c_fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e87aba94b5786407eb06d0502c1ed0885a0027a3f2b8537bfe15b0a92c01859
3
+ size 181903412
ckpts/slat_dec_rf_swin8_B_64l8r16_fp16.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "SLatRadianceFieldDecoder",
3
+ "args": {
4
+ "resolution": 64,
5
+ "model_channels": 768,
6
+ "latent_channels": 8,
7
+ "num_blocks": 12,
8
+ "num_heads": 12,
9
+ "mlp_ratio": 4,
10
+ "attn_mode": "swin",
11
+ "window_size": 8,
12
+ "use_fp16": true,
13
+ "representation_config": {
14
+ "rank": 16,
15
+ "dim": 8
16
+ }
17
+ }
18
+ }
ckpts/slat_dec_rf_swin8_B_64l8r16_fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:737da6578d01948016b7c39786113af0d64a46f7922f6b8b5e698b84643be514
3
+ size 171450488
ckpts/slat_enc_swin8_B_64l8_fp16.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "SLatEncoder",
3
+ "args": {
4
+ "resolution": 64,
5
+ "in_channels": 1024,
6
+ "model_channels": 768,
7
+ "latent_channels": 8,
8
+ "num_blocks": 12,
9
+ "num_heads": 12,
10
+ "mlp_ratio": 4,
11
+ "attn_mode": "swin",
12
+ "window_size": 8,
13
+ "use_fp16": true
14
+ }
15
+ }
ckpts/slat_enc_swin8_B_64l8_fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21dceac6bee917ab6458ff52c9757ba89a779d03031c7bd17f9e7f0103bfd436
3
+ size 173242816
ckpts/slat_flow_img_dit_L_64l8p2_fp16.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "SLatFlowModel",
3
+ "args": {
4
+ "resolution": 64,
5
+ "in_channels": 8,
6
+ "out_channels": 8,
7
+ "model_channels": 1024,
8
+ "cond_channels": 1024,
9
+ "num_blocks": 24,
10
+ "num_heads": 16,
11
+ "mlp_ratio": 4,
12
+ "patch_size": 2,
13
+ "num_io_res_blocks": 2,
14
+ "io_block_channels": [128],
15
+ "pe_mode": "ape",
16
+ "qk_rms_norm": true,
17
+ "use_fp16": true
18
+ }
19
+ }
ckpts/slat_flow_img_dit_L_64l8p2_fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:693fb2a58ad497bd222007301eeec49d14d60f8c12d2f2f00c221fa747b4c66c
3
+ size 1203755136
ckpts/ss_dec_conv3d_16l8_fp16.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ {
3
+ "name": "SparseStructureDecoder",
4
+ "args": {
5
+ "out_channels": 1,
6
+ "latent_channels": 8,
7
+ "num_res_blocks": 2,
8
+ "num_res_blocks_middle": 2,
9
+ "channels": [512, 128, 32],
10
+ "use_fp16": true
11
+ }
12
+ }
ckpts/ss_dec_conv3d_16l8_fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c76d4a40519aa2d711cc263a8404105231ac26db31d946bed48b84fee79009a
3
+ size 147591972
ckpts/ss_enc_conv3d_16l8_fp16.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ {
3
+ "name": "SparseStructureEncoder",
4
+ "args": {
5
+ "in_channels": 1,
6
+ "latent_channels": 8,
7
+ "num_res_blocks": 2,
8
+ "num_res_blocks_middle": 2,
9
+ "channels": [32, 128, 512],
10
+ "use_fp16": true
11
+ }
12
+ }
ckpts/ss_enc_conv3d_16l8_fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:107874eeaa0feb82f51b19db5da7db534fb7e7f19e5a122b9ff1bc2e258bfc6d
3
+ size 119068016
ckpts/ss_scenegen_flow_img_dit_L_16l8_fp16.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "SparseStructureFlowModel",
3
+ "args": {
4
+ "resolution": 16,
5
+ "in_channels": 8,
6
+ "out_channels": 8,
7
+ "model_channels": 1024,
8
+ "cond_channels": 1024,
9
+ "num_blocks": 24,
10
+ "num_heads": 16,
11
+ "mlp_ratio": 4,
12
+ "patch_size": 1,
13
+ "pe_mode": "ape",
14
+ "qk_rms_norm": true,
15
+ "use_fp16":true,
16
+ "use_global": true,
17
+ "trunk_depth": 4,
18
+ "num_iteration": 4,
19
+ "use_batch_encoder":false
20
+ }
21
+ }
ckpts/ss_scenegen_flow_img_dit_L_16l8_fp16.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0aa2be2fcf950c68da708be5b54fbe5289628e5d2aa425de41446d87c8c1936
3
+ size 2544030704
pipeline.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "SceneGenImageToScenePipeline",
3
+ "args": {
4
+ "models": {
5
+ "sparse_structure_decoder": "ckpts/ss_dec_conv3d_16l8_fp16",
6
+ "sparse_structure_flow_model": "ckpts/ss_scenegen_flow_img_dit_L_16l8_fp16",
7
+ "slat_decoder_gs": "ckpts/slat_dec_gs_swin8_B_64l8gs32_fp16",
8
+ "slat_decoder_rf": "ckpts/slat_dec_rf_swin8_B_64l8r16_fp16",
9
+ "slat_decoder_mesh": "ckpts/slat_dec_mesh_swin8_B_64l8m256c_fp16",
10
+ "slat_flow_model": "ckpts/slat_flow_img_dit_L_64l8p2_fp16"
11
+ },
12
+ "sparse_structure_sampler": {
13
+ "name": "FlowEulerGuidanceIntervalSamplerVGGT",
14
+ "args": {
15
+ "sigma_min": 1e-5
16
+ },
17
+ "params": {
18
+ "steps": 25,
19
+ "cfg_strength": 5.0,
20
+ "cfg_interval": [0.5, 1.0],
21
+ "rescale_t": 3.0
22
+ }
23
+ },
24
+ "slat_sampler": {
25
+ "name": "FlowEulerGuidanceIntervalSampler",
26
+ "args": {
27
+ "sigma_min": 1e-5
28
+ },
29
+ "params": {
30
+ "steps": 25,
31
+ "cfg_strength": 5.0,
32
+ "cfg_interval": [0.5, 1.0],
33
+ "rescale_t": 3.0
34
+ }
35
+ },
36
+ "slat_normalization": {
37
+ "mean": [
38
+ -2.1687545776367188,
39
+ -0.004347046371549368,
40
+ -0.13352349400520325,
41
+ -0.08418072760105133,
42
+ -0.5271206498146057,
43
+ 0.7238689064979553,
44
+ -1.1414450407028198,
45
+ 1.2039363384246826
46
+ ],
47
+ "std": [
48
+ 2.377650737762451,
49
+ 2.386378288269043,
50
+ 2.124418020248413,
51
+ 2.1748552322387695,
52
+ 2.663944721221924,
53
+ 2.371192216873169,
54
+ 2.6217446327209473,
55
+ 2.684523105621338
56
+ ]
57
+ },
58
+ "image_cond_model": "dinov2_vitl14_reg",
59
+ "vggt_model": "checkpoints/VGGT-1B"
60
+ }
61
+ }