GLD / README.md
SeonghuJeon's picture
Add pipeline tag, paper link, citation (incorporating HF community PR #1)
ddf12a2 verified
metadata
license: apache-2.0
pipeline_tag: image-to-3d
tags:
  - novel-view-synthesis
  - multi-view-diffusion
  - depth-estimation
  - 3d-reconstruction

GLD: Geometric Latent Diffusion

Repurposing Geometric Foundation Models for Multi-view Diffusion

[Paper] | [arXiv] | [Project Page] | [Code]

Geometric Latent Diffusion (GLD) is a framework that repurposes the geometrically consistent feature space of geometric foundation models (such as Depth Anything 3 and VGGT) as the latent space for multi-view diffusion. By operating in this space rather than a view-independent VAE latent space, GLD achieves consistent novel view synthesis (NVS) and 3D reconstruction with significantly faster training convergence.

Quick Start

git clone https://github.com/cvlab-kaist/GLD.git
cd GLD
conda env create -f environment.yml
conda activate gld

# Download all checkpoints
python -c "from huggingface_hub import snapshot_download; snapshot_download('SeonghuJeon/GLD', local_dir='.')"

# Run demo
./run_demo.sh da3

Files

File Description Size
checkpoints/da3_level1.pt DA3 Level-1 diffusion 3.0G
checkpoints/da3_cascade.pt DA3 Cascade (L1→L0) 1.8G
checkpoints/vggt_level1.pt VGGT Level-1 diffusion 3.1G
checkpoints/vggt_cascade.pt VGGT Cascade (L1→L0) 3.1G
pretrained_models/da3/model.safetensors DA3-Base encoder 0.5G
pretrained_models/da3/dpt_decoder.pt DPT decoder (depth + geometry) 0.4G
pretrained_models/mae_decoder.pt DA3 MAE decoder (RGB) 1.6G
pretrained_models/vggt/mae_decoder.pt VGGT MAE decoder (RGB) 1.6G

Citation

@article{jang2026gld,
  title={Repurposing Geometric Foundation Models for Multi-view Diffusion},
  author={Jang, Wooseok and Jeon, Seonghu and Han, Jisang and Choi, Jinhyeok and Kwon, Minkyung and Kim, Seungryong and Xie, Saining and Liu, Sainan},
  journal={arXiv preprint arXiv:2603.22275},
  year={2026}
}

Acknowledgements

Built upon RAE, Depth Anything 3, VGGT, CUT3R, and SiT.