Intel
/

ldm3d-pano

StableDiffusionLDM3DPipeline

stable-diffusion

stable-diffusion-diffusers

text-to-panoramic

Model card Files Files and versions

ldm3d-pano / README.md

dylanebert's picture

dylanebert HF Staff

Update README.md

b2ae312 verified over 1 year ago

|

3.33 kB

	---
	license: creativeml-openrail-m
	datasets:
	- laion/laion400m
	tags:
	- stable-diffusion
	- stable-diffusion-diffusers
	- text-to-image
	language:
	- en
	pipeline_tag: text-to-3d
	---

	# LDM3D-VR model

	The LDM3D-VR model was proposed in ["LDM3D-VR: Latent Diffusion Model for 3D"](https://arxiv.org/pdf/2311.03226.pdf) by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal.

	LDM3D-VR got accepted to [NeurIPS Workshop'23 on Diffusion Models][https://neurips.cc/virtual/2023/workshop/66539].

	This new checkpoint related to the upscaler called LDM3D-sr.

	# Model description
	The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano
	and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods.

	![LDM3D overview](model_overview.png)
	<font size="2">LDM3D overview taken from [the original paper](https://arxiv.org/abs/2305.10853)</font>


	### How to use

	Here is how to use this model to get the features of a given text in PyTorch:
	```python

	from diffusers import StableDiffusionLDM3DPipeline

	pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-pano")
	pipe.to("cuda")


	prompt ="360 view of a large bedroom"
	name = "bedroom_pano"

	output = pipe(
	prompt,
	width=1024,
	height=512,
	guidance_scale=7.0,
	num_inference_steps=50,
	)

	rgb_image, depth_image = output.rgb, output.depth
	rgb_image[0].save(name+"_ldm3d_rgb.jpg")
	depth_image[0].save(name+"_ldm3d_depth.png")
	```

	This is the result:

	![ldm3d_results](ldm3d_pano_results.png)


	### Finetuning

	This checkpoint finetunes the previous [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c) on 2 panoramic-images datasets:
	- [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set
	- [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set.


	These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples.

	In order to generate the depth map of those samples, we used [DPT-large](https://github.com/isl-org/MiDaS) and to generate the caption we used [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)


	### BibTeX entry and citation info
	@misc{stan2023ldm3dvr,
	title={LDM3D-VR: Latent Diffusion Model for 3D VR},
	author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal},
	year={2023},
	eprint={2311.03226},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}