metadata
license: apache-2.0
pipeline_tag: image-to-3d
tags:
- novel-view-synthesis
- multi-view-diffusion
- depth-estimation
- 3d-reconstruction
GLD: Geometric Latent Diffusion
Repurposing Geometric Foundation Models for Multi-view Diffusion
[Paper] | [arXiv] | [Project Page] | [Code]
Geometric Latent Diffusion (GLD) is a framework that repurposes the geometrically consistent feature space of geometric foundation models (such as Depth Anything 3 and VGGT) as the latent space for multi-view diffusion. By operating in this space rather than a view-independent VAE latent space, GLD achieves consistent novel view synthesis (NVS) and 3D reconstruction with significantly faster training convergence.
Quick Start
git clone https://github.com/cvlab-kaist/GLD.git
cd GLD
conda env create -f environment.yml
conda activate gld
# Download all checkpoints
python -c "from huggingface_hub import snapshot_download; snapshot_download('SeonghuJeon/GLD', local_dir='.')"
# Run demo
./run_demo.sh da3
Files
| File | Description | Size |
|---|---|---|
checkpoints/da3_level1.pt |
DA3 Level-1 diffusion | 3.0G |
checkpoints/da3_cascade.pt |
DA3 Cascade (L1→L0) | 1.8G |
checkpoints/vggt_level1.pt |
VGGT Level-1 diffusion | 3.1G |
checkpoints/vggt_cascade.pt |
VGGT Cascade (L1→L0) | 3.1G |
pretrained_models/da3/model.safetensors |
DA3-Base encoder | 0.5G |
pretrained_models/da3/dpt_decoder.pt |
DPT decoder (depth + geometry) | 0.4G |
pretrained_models/mae_decoder.pt |
DA3 MAE decoder (RGB) | 1.6G |
pretrained_models/vggt/mae_decoder.pt |
VGGT MAE decoder (RGB) | 1.6G |
Citation
@article{jang2026gld,
title={Repurposing Geometric Foundation Models for Multi-view Diffusion},
author={Jang, Wooseok and Jeon, Seonghu and Han, Jisang and Choi, Jinhyeok and Kwon, Minkyung and Kim, Seungryong and Xie, Saining and Liu, Sainan},
journal={arXiv preprint arXiv:2603.22275},
year={2026}
}
Acknowledgements
Built upon RAE, Depth Anything 3, VGGT, CUT3R, and SiT.