|
--- |
|
license: creativeml-openrail-m |
|
datasets: |
|
- laion/laion400m |
|
tags: |
|
- stable-diffusion |
|
- stable-diffusion-diffusers |
|
- text-to-image |
|
language: |
|
- en |
|
pipeline_tag: text-to-3d |
|
--- |
|
|
|
# LDM3D-VR model |
|
|
|
The LDM3D-VR model was proposed in ["LDM3D-VR: Latent Diffusion Model for 3D"](https://arxiv.org/pdf/2311.03226.pdf) by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal. |
|
|
|
LDM3D-VR got accepted to [NeurIPS Workshop'23 on Diffusion Models][https://neurips.cc/virtual/2023/workshop/66539]. |
|
|
|
This new checkpoint related to the upscaler called LDM3D-sr. |
|
|
|
# Model description |
|
The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano |
|
and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods. |
|
|
|
 |
|
<font size="2">LDM3D overview taken from [the original paper](https://arxiv.org/abs/2305.10853)</font> |
|
|
|
|
|
### How to use |
|
|
|
Here is how to use this model to get the features of a given text in PyTorch: |
|
```python |
|
|
|
from diffusers import StableDiffusionLDM3DPipeline |
|
|
|
pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-pano") |
|
pipe.to("cuda") |
|
|
|
|
|
prompt ="360 view of a large bedroom" |
|
name = "bedroom_pano" |
|
|
|
output = pipe( |
|
prompt, |
|
width=1024, |
|
height=512, |
|
guidance_scale=7.0, |
|
num_inference_steps=50, |
|
) |
|
|
|
rgb_image, depth_image = output.rgb, output.depth |
|
rgb_image[0].save(name+"_ldm3d_rgb.jpg") |
|
depth_image[0].save(name+"_ldm3d_depth.png") |
|
``` |
|
|
|
This is the result: |
|
|
|
 |
|
|
|
|
|
### Finetuning |
|
|
|
This checkpoint finetunes the previous [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c) on 2 panoramic-images datasets: |
|
- [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set |
|
- [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set. |
|
|
|
|
|
These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples. |
|
|
|
In order to generate the depth map of those samples, we used [DPT-large](https://github.com/isl-org/MiDaS) and to generate the caption we used [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) |
|
|
|
|
|
### BibTeX entry and citation info |
|
@misc{stan2023ldm3dvr, |
|
title={LDM3D-VR: Latent Diffusion Model for 3D VR}, |
|
author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal}, |
|
year={2023}, |
|
eprint={2311.03226}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |