Latte

latte text-to-video

Latte: Latent Diffusion Transformer for Video Generation from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University.

The abstract from the paper is:

We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

Highlights: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks - FaceForensics, SkyTimelapse, UCF101 and Taichi-HD. To prepare and download the datasets for evaluation, please refer to this https URL.

This pipeline was contributed by maxin-cn. The original codebase can be found here. The original weights can be found under hf.co/maxin-cn.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Inference

Use torch.compile to reduce the inference latency.

First, load the pipeline:

import torch
from diffusers import LattePipeline

pipeline = LattePipeline.from_pretrained(
	"maxin-cn/Latte-1", torch_dtype=torch.float16
).to("cuda")

Then change the memory layout of the pipelines transformer and vae components to torch.channels-last:

pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)

Finally, compile the components and run inference:

pipeline.transformer = torch.compile(pipeline.transformer)
pipeline.vae.decode = torch.compile(pipeline.vae.decode)

video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0]

The benchmark results on an 80GB A100 machine are:

Without torch.compile(): Average inference time: 16.246 seconds.
With torch.compile(): Average inference time: 14.573 seconds.

Quantization

Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.

Refer to the Quantization overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized LattePipeline for inference with bitsandbytes.

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LatteTransformer3DModel, LattePipeline
from diffusers.utils import export_to_gif
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel

quant_config = BitsAndBytesConfig(load_in_8bit=True)
text_encoder_8bit = T5EncoderModel.from_pretrained(
    "maxin-cn/Latte-1",
    subfolder="text_encoder",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = LatteTransformer3DModel.from_pretrained(
    "maxin-cn/Latte-1",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

pipeline = LattePipeline.from_pretrained(
    "maxin-cn/Latte-1",
    text_encoder=text_encoder_8bit,
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

prompt = "A small cactus with a happy face in the Sahara desert."
video = pipeline(prompt).frames[0]
export_to_gif(video, "latte.gif")

LattePipeline

class diffusers.LattePipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKL transformer: LatteTransformer3DModel scheduler: KarrasDiffusionSchedulers )

Parameters

vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
text_encoder (T5EncoderModel) — Frozen text-encoder. Latte uses T5, specifically the t5-v1_1-xxl variant.
tokenizer (T5Tokenizer) — Tokenizer of class T5Tokenizer.
transformer (LatteTransformer3DModel) — A text conditioned LatteTransformer3DModel to denoise the encoded video latents.
scheduler (SchedulerMixin) — A scheduler to be used in combination with transformer to denoise the encoded video latents.

Pipeline for text-to-video generation using Latte.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

call

< source >

( prompt: str | list[str] = None negative_prompt: str = '' num_inference_steps: int = 50 timesteps: list[int] | None = None guidance_scale: float = 7.5 num_images_per_prompt: int = 1 video_length: int = 16 height: int = 512 width: int = 512 eta: float = 0.0 generator: torch._C.Generator | list[torch._C.Generator] | None = None latents: torch.FloatTensor | None = None prompt_embeds: torch.FloatTensor | None = None negative_prompt_embeds: torch.FloatTensor | None = None output_type: str = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: list = ['latents'] clean_caption: bool = True mask_feature: bool = True enable_temporal_attentions: bool = True decode_chunk_size: int = 14 ) → LattePipelineOutput or tuple

Parameters

prompt (str or list[str], optional) — The prompt or prompts to guide the video generation. If not defined, one has to pass prompt_embeds. instead.
negative_prompt (str or list[str], optional) — The prompt or prompts not to guide the video generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_inference_steps (int, optional, defaults to 100) — The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference.
timesteps (list[int], optional) — Custom timesteps to use for the denoising process. If not defined, equal spaced num_inference_steps timesteps are used. Must be in descending order.
guidance_scale (float, optional, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate videos that are closely linked to the text prompt, usually at the expense of lower video quality.
video_length (int, optional, defaults to 16) — The number of video frames that are generated. Defaults to 16 frames which at 8 frames per seconds
num_images_per_prompt (int, optional, defaults to 1) — The number of videos to generate per prompt.
height (int, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated video.
width (int, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated video.
eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://huggingface.co/papers/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
generator (torch.Generator or list[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For Latte this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
output_type (str, optional, defaults to "pil") — The output format of the generate video. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.IFPipelineOutput instead of a plain tuple.
callback_on_step_end (Callable[[int, int], None], PipelineCallback, MultiPipelineCallbacks, optional) — A callback function or a list of callback functions to be called at the end of each denoising step.
callback_on_step_end_tensor_inputs (list[str], optional) — A list of tensor inputs that should be passed to the callback function. If not defined, all tensor inputs will be passed.
clean_caption (bool, optional, defaults to True) — Whether or not to clean the caption before creating embeddings. Requires beautifulsoup4 and ftfy to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
mask_feature (bool defaults to True) — If set to True, the text embeddings will be masked.
enable_temporal_attentions (bool, optional, defaults to True) — Whether to enable temporal attentions
decode_chunk_size (int, optional) — The number of frames to decode at a time. Higher chunk size leads to better temporal consistency at the expense of more memory usage. By default, the decoder decodes all frames at once for maximal quality. For lower memory usage, reduce decode_chunk_size.

Returns

LattePipelineOutput or tuple

If return_dict is True, LattePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import LattePipeline
>>> from diffusers.utils import export_to_gif

>>> # You can replace the checkpoint id with "maxin-cn/Latte-1" too.
>>> pipe = LattePipeline.from_pretrained("maxin-cn/Latte-1", torch_dtype=torch.float16)
>>> # Enable memory optimizations.
>>> pipe.enable_model_cpu_offload()

>>> prompt = "A small cactus with a happy face in the Sahara desert."
>>> videos = pipe(prompt).frames[0]
>>> export_to_gif(videos, "latte.gif")

encode_prompt

< source >

( prompt: str | list[str] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: torch.device | None = None prompt_embeds: torch.FloatTensor | None = None negative_prompt_embeds: torch.FloatTensor | None = None clean_caption: bool = False mask_feature: bool = True dtype = None )

Parameters

prompt (str or list[str], optional) — prompt to be encoded
negative_prompt (str or list[str], optional) — The prompt not to guide the video generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). For Latte, this should be "".
do_classifier_free_guidance (bool, optional, defaults to True) — whether to use classifier free guidance or not
num_images_per_prompt (int, optional, defaults to 1) — number of video that should be generated per prompt
device — (torch.device, optional): torch device to place the resulting embeddings on
prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For Latte, it’s should be the embeddings of the "" string.
clean_caption (bool, defaults to False) — If True, the function will preprocess and clean the provided caption before encoding.
mask_feature — (bool, defaults to True): If True, the function will mask the text embeddings.

Encodes the prompt into text encoder hidden states.

Update on GitHub

Diffusers

Latte

Inference

Quantization

LattePipeline

class diffusers.LattePipeline

__call__

encode_prompt

call