File size: 3,333 Bytes
361d25f
 
 
 
 
 
 
 
 
 
b2ae312
361d25f
 
c0f2950
361d25f
a678f06
361d25f
a678f06
361d25f
c0f2950
361d25f
c0f2950
 
 
361d25f
 
 
 
 
 
 
 
 
 
 
 
80cf1d0
361d25f
 
 
80cf1d0
 
361d25f
738baaf
 
 
 
 
 
 
 
361d25f
 
 
 
 
 
 
80cf1d0
361d25f
 
 
 
80cf1d0
 
 
361d25f
80cf1d0
 
 
 
361d25f
 
 
58e39b8
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: creativeml-openrail-m
datasets:
- laion/laion400m
tags:
- stable-diffusion
- stable-diffusion-diffusers
- text-to-image
language:
- en
pipeline_tag: text-to-3d
---

# LDM3D-VR model

The LDM3D-VR model was proposed in ["LDM3D-VR: Latent Diffusion Model for 3D"](https://arxiv.org/pdf/2311.03226.pdf) by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal.

LDM3D-VR got accepted to [NeurIPS Workshop'23 on Diffusion Models][https://neurips.cc/virtual/2023/workshop/66539].

This new checkpoint related to the upscaler called LDM3D-sr.

# Model description
The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano
and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods. 

![LDM3D overview](model_overview.png)
<font size="2">LDM3D overview taken from [the original paper](https://arxiv.org/abs/2305.10853)</font>


### How to use

Here is how to use this model to get the features of a given text in PyTorch:
```python

from diffusers import StableDiffusionLDM3DPipeline

pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-pano")
pipe.to("cuda")


prompt ="360 view of a large bedroom"
name = "bedroom_pano"

output = pipe(
        prompt,
        width=1024,
        height=512,
        guidance_scale=7.0,
        num_inference_steps=50,
    ) 

rgb_image, depth_image = output.rgb, output.depth
rgb_image[0].save(name+"_ldm3d_rgb.jpg")
depth_image[0].save(name+"_ldm3d_depth.png")
```

This is the result:

![ldm3d_results](ldm3d_pano_results.png)


### Finetuning

This checkpoint finetunes the previous [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c) on 2 panoramic-images datasets: 
- [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set
- [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set.

  
These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples.

In order to generate the depth map of those samples, we used [DPT-large](https://github.com/isl-org/MiDaS) and to generate the caption we used [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)


### BibTeX entry and citation info
@misc{stan2023ldm3dvr,
      title={LDM3D-VR: Latent Diffusion Model for 3D VR}, 
      author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal},
      year={2023},
      eprint={2311.03226},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}