Diffusers documentation

I2VGen-XL

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

I2VGen-XL

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou.

The abstract from the paper is:

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video’s details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at this https URL.

The original codebase can be found here. The model checkpoints can be found here.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the [“Reduce memory usage”] section here.

Sample output with I2VGenXL:

masterpiece, bestquality, sunset.
library

Notes

  • I2VGenXL always uses a clip_skip value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP.
  • It can generate videos of quality that is often on par with Stable Video Diffusion (SVD).
  • Unlike SVD, it additionally accepts text prompts as inputs.
  • It can generate higher resolution videos.
  • When using the DDIMScheduler (which is default for this pipeline), less than 50 steps for inference leads to bad results.

I2VGenXLPipeline

class diffusers.I2VGenXLPipeline

< >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer image_encoder: CLIPVisionModelWithProjection feature_extractor: CLIPImageProcessor unet: I2VGenXLUNet scheduler: DDIMScheduler )

Parameters

  • vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
  • text_encoder (CLIPTextModel) — Frozen text-encoder (clip-vit-large-patch14).
  • tokenizer (CLIPTokenizer) — A CLIPTokenizer to tokenize text.
  • unet (I2VGenXLUNet) — A I2VGenXLUNet to denoise the encoded video latents.
  • scheduler (DDIMScheduler) — A scheduler to be used in combination with unet to denoise the encoded image latents.

Pipeline for image-to-video generation as proposed in I2VGenXL.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< >

( prompt: Union = None image: Union = None height: Optional = 704 width: Optional = 1280 target_fps: Optional = 16 num_frames: int = 16 num_inference_steps: int = 50 guidance_scale: float = 9.0 negative_prompt: Union = None eta: float = 0.0 num_videos_per_prompt: Optional = 1 decode_chunk_size: Optional = 1 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None output_type: Optional = 'pil' return_dict: bool = True cross_attention_kwargs: Optional = None clip_skip: Optional = 1 ) pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.
  • image (PIL.Image.Image or List[PIL.Image.Image] or torch.FloatTensor) — Image or images to guide image generation. If you provide a tensor, it needs to be compatible with CLIPImageProcessor.
  • height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image.
  • width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image.
  • target_fps (int, optional) — Frames per second. The rate at which the generated images shall be exported to a video after generation. This is also used as a “micro-condition” while generation.
  • num_frames (int, optional) — The number of video frames to generate.
  • num_inference_steps (int, optional) — The number of denoising steps.
  • guidance_scale (float, optional, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.
  • negative_prompt (str or List[str], optional) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale < 1).
  • eta (float, optional) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the DDIMScheduler, and is ignored in other schedulers.
  • num_videos_per_prompt (int, optional) — The number of images to generate per prompt.
  • decode_chunk_size (int, optional) — The number of frames to decode at a time. The higher the chunk size, the higher the temporal consistency between frames, but also the higher the memory consumption. By default, the decoder will decode all frames at once for maximal quality. Reduce decode_chunk_size to reduce memory usage.
  • generator (torch.Generator or List[torch.Generator], optional) — A torch.Generator to make generation deterministic.
  • latents (torch.FloatTensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument.
  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple.
  • cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in self.processor.
  • clip_skip (int, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.

Returns

pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput or tuple

If return_dict is True, pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated frames.

The call function to the pipeline for image-to-video generation with I2VGenXLPipeline.

Examples:

>>> import torch
>>> from diffusers import I2VGenXLPipeline

>>> pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
>>> pipeline.enable_model_cpu_offload()

>>> image_url = "https://github.com/ali-vilab/i2vgen-xl/blob/main/data/test_images/img_0009.png?raw=true"
>>> image = load_image(image_url).convert("RGB")

>>> prompt = "Papers were floating in the air on a table in the library"
>>> negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
>>> generator = torch.manual_seed(8888)

>>> frames = pipeline(
...     prompt=prompt,
...     image=image,
...     num_inference_steps=50,
...     negative_prompt=negative_prompt,
...     guidance_scale=9.0,
...     generator=generator
... ).frames[0]
>>> video_path = export_to_gif(frames, "i2v.gif")

disable_freeu

< >

( )

Disables the FreeU mechanism if enabled.

disable_vae_slicing

< >

( )

Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.

disable_vae_tiling

< >

( )

Disable tiled VAE decoding. If enable_vae_tiling was previously enabled, this method will go back to computing decoding in one step.

enable_freeu

< >

( s1: float s2: float b1: float b2: float )

Parameters

  • s1 (float) — Scaling factor for stage 1 to attenuate the contributions of the skip features. This is done to mitigate “oversmoothing effect” in the enhanced denoising process.
  • s2 (float) — Scaling factor for stage 2 to attenuate the contributions of the skip features. This is done to mitigate “oversmoothing effect” in the enhanced denoising process.
  • b1 (float) — Scaling factor for stage 1 to amplify the contributions of backbone features.
  • b2 (float) — Scaling factor for stage 2 to amplify the contributions of backbone features.

Enables the FreeU mechanism as in https://arxiv.org/abs/2309.11497.

The suffixes after the scaling factors represent the stages where they are being applied.

Please refer to the official repository for combinations of the values that are known to work well for different pipelines such as Stable Diffusion v1, v2, and Stable Diffusion XL.

enable_vae_slicing

< >

( )

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

enable_vae_tiling

< >

( )

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.

encode_prompt

< >

( prompt device num_videos_per_prompt negative_prompt = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None lora_scale: Optional = None clip_skip: Optional = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded device — (torch.device): torch device
  • num_videos_per_prompt (int) — number of images that should be generated per prompt
  • do_classifier_free_guidance (bool) — whether to use classifier free guidance or not
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • lora_scale (float, optional) — A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
  • clip_skip (int, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.

Encodes the prompt into text encoder hidden states.

I2VGenXLPipelineOutput

class diffusers.pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput

< >

( frames: Union )

Parameters

  • frames (List[np.ndarray] or torch.FloatTensor) — List of denoised frames (essentially images) as NumPy arrays of shape (height, width, num_channels) or as a torch tensor. The length of the list denotes the video length (the number of frames).

Output class for image-to-video pipeline.