Question about the VAE's input/output channel dimensions

#3
by terryryu - opened

Hi, thank you for sharing this interesting work!

But, I have a question about the provided VAE's input and output channels.
In ldm3d-4c/vae/config.json it is written that the VAE has four input channels, which is different from the expected six channels written in the paper (RGB + RGB-like depth map). Ignoring this I just tried encoding my data with the provided VAE, but was interrupted by the following error.

File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 242, in encode
h = self.encoder(x)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/vae.py", line 111, in forward
sample = self.conv_in(sample)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [128, 4, 3, 3], expected input[1, 6, 512, 512] to have 4 channels, but got 6 channels instead

So, I just turned my depth map into a 8 bit image and merged it with the RGB image. But then, I was faced with yet another error

File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 242, in encode
h = self.encoder(x)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/vae.py", line 140, in forward
sample = down_block(sample)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1214, in forward
hidden_states = resnet(hidden_states, temb=None)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/diffusers/models/resnet.py", line 597, in forward
hidden_states = self.norm1(hidden_states)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward
return F.group_norm(
File "/home/terryryu/miniconda3/envs/p3dpp/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be a vector of size equal to the number of channels in input, but got weight of shape [128] and input of shape [128, 512, 512]

I guess I have misunderstood something? Also, did you use bicubic upsampling to scale the depthmap resolution from 384 x 384 to 512 x 512?

Hi @terryryu !!

In ldm3d-4c/vae/config.json it is written that the VAE has four input channels, which is different from the expected six channels written in the paper (RGB + RGB-like depth map).

Indeed! For this version of ldm3d, that we called "ldm3d-4c", we mapped the depth to a 1D vector, making the input a 4-channel vector (3 for RGB and 1 for Depth).

Also, did you use bicubic upsampling to scale the depthmap resolution from 384 x 384 to 512 x 512?

We are using dpt-512 on a 512 input size. The output is automatically of 512 resolution

As for the error, I am not sure. A bit hard to answer. Maybe you have a short snippet so I can try to reproduce?

Best
Estelle

Thank you for your help @estellea !

I've come up with a minimal example that encodes and decodes the RGBD output of the lemon example.

import torch
import cv2
import numpy as np
from einops import rearrange
from diffusers import AutoencoderKL
from diffusers.image_processor import VaeImageProcessorLDM3D

def load_images(rgb_path, depth_path):
    rgb_img = cv2.imread(rgb_path) / 255.
    depth_img = cv2.imread(depth_path, cv2.IMREAD_UNCHANGED) # ensures 16-bit is preserved

    if depth_img.dtype != np.uint16:
        raise ValueError("Depth image is not 16-bit!")

    depth_img = depth_img / 65536.


    depth_img_expanded = np.expand_dims(depth_img, axis=-1)
    merged_img = np.concatenate([rgb_img, depth_img_expanded], axis=-1)

    return merged_img

with torch.no_grad():

    test_rgbd = load_images("/home/terryryu/Experiments/LDM3D/lemons_ldm3d_rgb.jpg", "/home/terryryu/Experiments/LDM3D/lemons_ldm3d_depth.png")

    vae = AutoencoderKL.from_pretrained("/home/terryryu/Weights/LDM3D/vae/", local_files_only=True, torch_dtype=torch.float16).cuda()
    vae_scale_factor = 2 ** (len(vae.config.block_out_channels) - 1)
    processor = VaeImageProcessorLDM3D(vae_scale_factor=vae_scale_factor)
    
    test_rgbd = rearrange(test_rgbd, "h w c -> 1 c h w")
    test_rgbd = torch.cuda.HalfTensor(test_rgbd)
    latents = vae.encode(test_rgbd).latent_dist.mode()
    image = vae.decode(latents / vae.config.scaling_factor, return_dict=False)[0]
    output_type = "pil"
    do_denormalize = [True] * image.shape[0]
    rgb, depth = processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
    
    rgb[0].save("./minimal_test_rgb.png")
    depth[0].save("./minimal_test_depth.png")

Strangely, the color of the decoded result is broken. Maybe I've made a minor mistake somewhere...?

Input RGB Input Depth Output RGB Output Depth
lemons_ldm3d_rgb.jpg lemons_ldm3d_depth.png minimal_test_rgb.png minimal_test_depth.png

Best,
Ryu

I have the same problem how to use the ldm3d-4c vae to reconstruct the RGB and depth image?

Intel org

Before encoding, make sure to normalize both the image and the depth to the range of [-1, 1], by adding this to 'load_images':
rgb_img = 2. * rgb_img - 1.
depth_img = 2. * depth_img - 1.
also, there is no need to divide the latents by any scaling factor during the reconstruction since the latent space does not undergo scaling at this stage. However, when utilizing the diffusion aspect of the process, ensure that you scale the latent space both prior to and after the diffusion to achieve the desired results.
@terryryu , attaching updated example:

import torch
import cv2
import numpy as np
from einops import rearrange
from diffusers import AutoencoderKL
from diffusers.image_processor import VaeImageProcessorLDM3D

def load_images(rgb_path, depth_path):
    rgb_img = cv2.imread(rgb_path) / 255.
    rgb_img = 2.*rgb_img - 1.
    depth_img = cv2.imread(depth_path, cv2.IMREAD_UNCHANGED) # ensures 16-bit is preserved

    if depth_img.dtype != np.uint16:
        raise ValueError("Depth image is not 16-bit!")

    depth_img = depth_img / 65536.
    depth_img = 2.*depth_img - 1.


    depth_img_expanded = np.expand_dims(depth_img, axis=-1)
    merged_img = np.concatenate([rgb_img, depth_img_expanded], axis=-1)

    return merged_img

with torch.no_grad():

    test_rgbd = load_images("/home/terryryu/Experiments/LDM3D/lemons_ldm3d_rgb.jpg", "/home/terryryu/Experiments/LDM3D/lemons_ldm3d_depth.png")

    vae = AutoencoderKL.from_pretrained("/home/terryryu/Weights/LDM3D/vae/", local_files_only=True, torch_dtype=torch.float16).cuda()
    vae_scale_factor = 2 ** (len(vae.config.block_out_channels) - 1)
    processor = VaeImageProcessorLDM3D(vae_scale_factor=vae_scale_factor)
    
    test_rgbd = rearrange(test_rgbd, "h w c -> 1 c h w")
    test_rgbd = torch.cuda.HalfTensor(test_rgbd)
    latents = vae.encode(test_rgbd).latent_dist.mode()
    image = vae.decode(latents, return_dict=False)[0]
    output_type = "pil"
    do_denormalize = [True] * image.shape[0]
    rgb, depth = processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
    
    rgb[0].save("./minimal_test_rgb.png")
    depth[0].save("./minimal_test_depth.png")

Best,
Gabi

Sign up or log in to comment