--- license: creativeml-openrail-m datasets: - laion/laion400m tags: - stable-diffusion - stable-diffusion-diffusers - text-to-image language: - en pipeline_tag: text-to-3d --- # LDM3D-VR model The LDM3D-VR model was proposed in ["LDM3D-VR: Latent Diffusion Model for 3D"](https://arxiv.org/pdf/2311.03226.pdf) by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal. LDM3D-VR got accepted to [NeurIPS Workshop'23 on Diffusion Models][https://neurips.cc/virtual/2023/workshop/66539]. This new checkpoint related to the upscaler called LDM3D-sr. # Model description The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods. ![LDM3D overview](model_overview.png) LDM3D overview taken from [the original paper](https://arxiv.org/abs/2305.10853) ### How to use Here is how to use this model to get the features of a given text in PyTorch: ```python from diffusers import StableDiffusionLDM3DPipeline pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-pano") pipe.to("cuda") prompt ="360 view of a large bedroom" name = "bedroom_pano" output = pipe( prompt, width=1024, height=512, guidance_scale=7.0, num_inference_steps=50, ) rgb_image, depth_image = output.rgb, output.depth rgb_image[0].save(name+"_ldm3d_rgb.jpg") depth_image[0].save(name+"_ldm3d_depth.png") ``` This is the result: ![ldm3d_results](ldm3d_pano_results.png) ### Finetuning This checkpoint finetunes the previous [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c) on 2 panoramic-images datasets: - [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set - [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set. These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples. In order to generate the depth map of those samples, we used [DPT-large](https://github.com/isl-org/MiDaS) and to generate the caption we used [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) ### BibTeX entry and citation info @misc{stan2023ldm3dvr, title={LDM3D-VR: Latent Diffusion Model for 3D VR}, author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal}, year={2023}, eprint={2311.03226}, archivePrefix={arXiv}, primaryClass={cs.CV} }