can this vae be used in stable video diffusion?

#17
by jiagaoxiang - opened

I am trying to run stable video diffusion with the code below but got all black videos (which is composed of all black images). It seems to be caused by the fp16 format. After trying your VAE, I got this error: "RuntimeError: Input type (float) and bias type (c10::Half) should be the same". Any suggestions for how to fix?

#################################code###

import torch

from diffusers import StableVideoDiffusionPipeline, AutoencoderKL
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid", torch_dtype=torch.float16, variant="fp16"
)

pipe.to("cuda")

Load the conditioning image

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)

export_to_video(frames, "generated.mp4", fps=7)

sdxl-vae-fp16-fix cannot be used with SVD, because SVD uses the Stable Diffusion 1/2 latent space (see code, paper), whereas sdxl-vae-fp16-fix uses the SDXL latent space, and the SD1/2 and SDXL latent spaces are not compatible.
image.png
image.png

Hopefully the stabilityai/stable-video-diffusion-img2vid thread can find a solution to the issue you're encountering.

Thank you for the info!

Sign up or log in to comment