Files changed (1) hide show
  1. README.md +85 -13
README.md CHANGED
@@ -1,27 +1,58 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: diffusers
3
- pipeline_tag: image-to-image
 
4
  ---
5
 
6
- # LDM3D-VR model
7
 
8
- The LDM3D-VR model was proposed in ["LDM3D-VR: Latent Diffusion Model for 3D"](https://arxiv.org/pdf/2311.03226.pdf) by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal.
9
 
10
- LDM3D-VR got accepted to [NeurIPS Workshop'23 on Diffusion Models][https://neurips.cc/virtual/2023/workshop/66539].
11
 
12
- This new checkpoint related to the upscaler called ldm3d-sr.
13
 
14
- # Model description
15
- The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano
16
- and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods.
17
 
18
  ![LDM3D-SR overview](ldm3d-sr-overview.png)
19
  <font size="2">LDM3D-SR overview </font>
20
 
21
 
22
- ## Examples
23
 
24
- Using the [🤗's Diffusers library](https://github.com/huggingface/diffusers) in a simple and efficient manner.
25
 
26
  ```python
27
  from PIL import Image
@@ -54,17 +85,57 @@ upscaled_rgb.save(f"upscaled_lemons_rgb.png")
54
  upscaled_depth.save(f"upscaled_lemons_depth.png")
55
  ```
56
 
57
-
58
- ## Results
59
 
60
  Output of ldm3d-4c | Upscaled output
61
  :-------------------------:|:-------------------------:
62
  ![ldm3d_rgb_results](lemons_ldm3d_rgb.jpg) | ![ldm3d_sr_rgb_results](upscaled_lemons_rgb.png)
63
  ![ldm3d_depth_results](lemons_ldm3d_depth.png) | ![ldm3d_sr_depth_results](upscaled_lemons_depth.png)
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
 
66
 
67
  ### BibTeX entry and citation info
 
68
  @misc{stan2023ldm3dvr,
69
  title={LDM3D-VR: Latent Diffusion Model for 3D VR},
70
  author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal},
@@ -72,4 +143,5 @@ Output of ldm3d-4c | Upscaled output
72
  eprint={2311.03226},
73
  archivePrefix={arXiv},
74
  primaryClass={cs.CV}
75
- }
 
 
1
  ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - stable-diffusion
6
+ - stable-diffusion-diffusers
7
+ - text-to-image
8
+ model-index:
9
+ - name: ldm3d-sr
10
+ results:
11
+ - task:
12
+ name: Latent Diffusion Model for 3D - Super-Resolution
13
+ type: latent-diffusion-model-for-3D-SR
14
+ dataset:
15
+ name: LAION-400M
16
+ type: laion/laion400m
17
+ metrics:
18
+ - name: LDM3D-SR-B FID
19
+ type: LDM3D-SR-B FID
20
+ value: 14.705
21
+ - name: LDM3D-SR-B IS
22
+ type: LDM3D-SR-B IS
23
+ value: 60.371
24
+ - name: LDM3D-SR-B PSNR
25
+ type: LDM3D-SR-B PSNR
26
+ value: 24.479
27
+ - name: LDM3D-SR-B SSIM
28
+ type: LDM3D-SR-B SSIM
29
+ value: 0.665
30
+ - name: LDM3D-SR-B Depth MARE
31
+ type: LDM3D-SR-B Depth MARE
32
+ value: 0.0537
33
  library_name: diffusers
34
+ pipeline_tag: text-to-3d
35
+ license: creativeml-openrail-m
36
  ---
37
 
38
+ # LDM3D-SR model
39
 
40
+ The LDM3D-VR model suite was proposed in the paper [LDM3D-VR: Latent Diffusion Model for 3D](https://arxiv.org/pdf/2311.03226.pdf), authored by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, and Vasudev Lal.
41
 
42
+ LDM3D-VR was accepted to the [NeurIPS 2023 Workshop on Diffusion Models](https://neurips.cc/virtual/2023/workshop/66539).
43
 
44
+ This new checkpoint is related to the upscaler called LDM3D-SR.
45
 
46
+ ## Model details
47
+ Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods.
 
48
 
49
  ![LDM3D-SR overview](ldm3d-sr-overview.png)
50
  <font size="2">LDM3D-SR overview </font>
51
 
52
 
53
+ ## Usage
54
 
55
+ Using the [🤗's Diffusers library](https://github.com/huggingface/diffusers) in a simple and efficient manner.
56
 
57
  ```python
58
  from PIL import Image
 
85
  upscaled_depth.save(f"upscaled_lemons_depth.png")
86
  ```
87
 
88
+ This is the result:
 
89
 
90
  Output of ldm3d-4c | Upscaled output
91
  :-------------------------:|:-------------------------:
92
  ![ldm3d_rgb_results](lemons_ldm3d_rgb.jpg) | ![ldm3d_sr_rgb_results](upscaled_lemons_rgb.png)
93
  ![ldm3d_depth_results](lemons_ldm3d_depth.png) | ![ldm3d_sr_depth_results](upscaled_lemons_depth.png)
94
 
95
+ ## Training data
96
+
97
+ The LDM3D model was finetuned on a dataset constructed from a subset of the LAION-400M dataset, a large-scale image-caption dataset that contains over 400 million image-caption pairs. In the finetuning process of the LDM3D-SR, the training data consists of additional high-resolution (HR) and low-resolution (LR) sets with 261,045 samples each. For HR samples, a subset of LAION Aesthetics 6+ with tuples (captions, 512x512-sized images, and depth maps from DPT-BEiT-L-512) is used. LR images are generated using a lightweight BSR-image-degradation method, introduced in applied to the HR image.
98
+
99
+ ### Finetuning
100
+
101
+ The fine-tuning process comprises two stages. In the first stage, we train an autoencoder to generate a lower-dimensional, perceptually equivalent data representation. Subsequently, we fine-tune the diffusion model using the frozen autoencoder.
102
+
103
+ LDM3D-SR utilizes the autoencoder previously developed for [LDM3D-4c](https://huggingface.co/Intel/ldm3d-4c) to now encode low-resolution (LR) images into a 64x64x4 dimensional latent space. The diffusion model used here is an adapted version of the U-Net, now modified to have an 8-channel input. This change enables conditioning on LR latent via concatenation to the high-resolution (HR) latent during training, and to noise during inference. Text conditioning is also facilitated using cross attention with a CLIP text encoder.
104
+
105
+ ## Evaluation results
106
+
107
+ The table below shows the quantitative results of upscaling from 128 x 128 to 512 x 512, evaluated on 2,243 samples from ImageNet-Val. We explore three methods for generating LR depth maps: performing depth estimation on the LR depth maps (LDM3D-SR-D), utilizing the original HR depth map for LR conditioning (LDM3D-SR-O), and applying bicubic degradation to the depth map (LDM3D-SR-B).
108
+
109
+ |Method |FID ↓ |IS ↑ |PSNR ↑ |SSIM ↑ |Depth MARE ↓ |
110
+ |-------------------|------|-----------|-----------|----------|-------------|
111
+ |Regression, bicubic|24.686|60.135±4.16|26.424±3.98|0.716±0.13|0.0153±0.0189|
112
+ |SDx4[29] |15.865|61.103±3.48|24.528±3.63|0.631±0.15|N/A |
113
+ |LDMx4[30] |15.245|60.060±3.88|25.511±3.94|0.686±0.16|N/A |
114
+ |SD-superres[2] |15.254|59.789±3.53|23.878±3.28|0.642±0.15|N/A |
115
+ |LDM3D-SR-D |15.522|59.736±3.37|24.113±3.54|0.659±0.16|0.0753±0.0734|
116
+ |LDM3D-SR-O |14.793|60.260±3.53|24.498±3.59|0.665±0.16|0.0530±0.0496|
117
+ |LDM3D-SR-B |14.705|60.371±3.56|24.479±3.58|0.665±0.48|0.0537±0.0506|
118
+
119
+ The results shown above can be referenced in Table 3 of the [LDM3D-VR paper](https://arxiv.org/pdf/2311.03226.pdf).
120
+
121
+ ## Ethical Considerations and Limitations
122
+
123
+ For image generation, the [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion-v1-4#limitations) limitations and biases apply. For depth map generation, a first limitiation is that we are using DPT-large to produce the ground truth, hence, other limitations and biases from [DPT](https://huggingface.co/Intel/dpt-large) are applicable.
124
+
125
+ ## Caveats and Recommendations
126
+
127
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
128
+
129
+ Here are a couple of useful links to learn more about Intel's AI software:
130
+ * [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch)
131
+ * [Intel Neural Compressor](https://github.com/intel/neural-compressor)
132
+
133
+ ## Disclaimer
134
 
135
+ The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
136
 
137
  ### BibTeX entry and citation info
138
+ ```bibtex
139
  @misc{stan2023ldm3dvr,
140
  title={LDM3D-VR: Latent Diffusion Model for 3D VR},
141
  author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal},
 
143
  eprint={2311.03226},
144
  archivePrefix={arXiv},
145
  primaryClass={cs.CV}
146
+ }
147
+ ```