Latte
Latte: Latent Diffusion Transformer for Video Generation from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University.
The abstract from the paper is:
We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.
Highlights: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks - FaceForensics, SkyTimelapse, UCF101 and Taichi-HD. To prepare and download the datasets for evaluation, please refer to this https URL.
This pipeline was contributed by maxin-cn. The original codebase can be found here. The original weights can be found under hf.co/maxin-cn.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
Inference
Use torch.compile
to reduce the inference latency.
First, load the pipeline:
import torch
from diffusers import LattePipeline
pipeline = LattePipeline.from_pretrained(
"maxin-cn/Latte-1", torch_dtype=torch.float16
).to("cuda")
Then change the memory layout of the pipelines transformer
and vae
components to torch.channels-last
:
pipeline.transformer.to(memory_format=torch.channels_last) pipeline.vae.to(memory_format=torch.channels_last)
Finally, compile the components and run inference:
pipeline.transformer = torch.compile(pipeline.transformer)
pipeline.vae.decode = torch.compile(pipeline.vae.decode)
video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0]
The benchmark results on an 80GB A100 machine are:
Without torch.compile(): Average inference time: 16.246 seconds.
With torch.compile(): Average inference time: 14.573 seconds.
LattePipeline
class diffusers.LattePipeline
< source >( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKL transformer: LatteTransformer3DModel scheduler: KarrasDiffusionSchedulers )
Parameters
- vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder (
T5EncoderModel
) — Frozen text-encoder. Latte uses T5, specifically the t5-v1_1-xxl variant. - tokenizer (
T5Tokenizer
) — Tokenizer of class T5Tokenizer. - transformer (LatteTransformer3DModel) —
A text conditioned
LatteTransformer3DModel
to denoise the encoded video latents. - scheduler (SchedulerMixin) —
A scheduler to be used in combination with
transformer
to denoise the encoded video latents.
Pipeline for text-to-video generation using Latte.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None guidance_scale: float = 7.5 num_images_per_prompt: int = 1 video_length: int = 16 height: int = 512 width: int = 512 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: str = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] clean_caption: bool = True mask_feature: bool = True enable_temporal_attentions: bool = True decode_chunk_size: typing.Optional[int] = None ) → LattePipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the video generation. If not defined, one has to passprompt_embeds
. instead. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the video generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - num_inference_steps (
int
, optional, defaults to 100) — The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference. - timesteps (
List[int]
, optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_steps
timesteps are used. Must be in descending order. - guidance_scale (
float
, optional, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate videos that are closely linked to the textprompt
, usually at the expense of lower video quality. - video_length (
int
, optional, defaults to 16) — The number of video frames that are generated. Defaults to 16 frames which at 8 frames per seconds - num_images_per_prompt (
int
, optional, defaults to 1) — The number of videos to generate per prompt. - height (
int
, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated video. - width (
int
, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated video. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. For Latte this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate video. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutput
instead of a plain tuple. - callback_on_step_end (
Callable[[int, int, Dict], None]
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A callback function or a list of callback functions to be called at the end of each denoising step. - callback_on_step_end_tensor_inputs (
List[str]
, optional) — A list of tensor inputs that should be passed to the callback function. If not defined, all tensor inputs will be passed. - clean_caption (
bool
, optional, defaults toTrue
) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4
andftfy
to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt. - mask_feature (
bool
defaults toTrue
) — If set toTrue
, the text embeddings will be masked. - enable_temporal_attentions (
bool
, optional, defaults toTrue
) — Whether to enable temporal attentions - decode_chunk_size (
int
, optional) — The number of frames to decode at a time. Higher chunk size leads to better temporal consistency at the expense of more memory usage. By default, the decoder decodes all frames at once for maximal quality. For lower memory usage, reducedecode_chunk_size
.
Returns
LattePipelineOutput
or tuple
If return_dict
is True
, LattePipelineOutput
is returned,
otherwise a tuple
is returned where the first element is a list with the generated images
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import LattePipeline
>>> from diffusers.utils import export_to_gif
>>> # You can replace the checkpoint id with "maxin-cn/Latte-1" too.
>>> pipe = LattePipeline.from_pretrained("maxin-cn/Latte-1", torch_dtype=torch.float16)
>>> # Enable memory optimizations.
>>> pipe.enable_model_cpu_offload()
>>> prompt = "A small cactus with a happy face in the Sahara desert."
>>> videos = pipe(prompt).frames[0]
>>> export_to_gif(videos, "latte.gif")
encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None clean_caption: bool = False mask_feature: bool = True dtype = None )
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - negative_prompt (
str
orList[str]
, optional) — The prompt not to guide the video generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). For Latte, this should be "". - do_classifier_free_guidance (
bool
, optional, defaults toTrue
) — whether to use classifier free guidance or not - num_images_per_prompt (
int
, optional, defaults to 1) — number of video that should be generated per prompt - device — (
torch.device
, optional): torch device to place the resulting embeddings on - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. For Latte, it’s should be the embeddings of the "" string. - clean_caption (bool, defaults to
False
) — IfTrue
, the function will preprocess and clean the provided caption before encoding. - mask_feature — (bool, defaults to
True
): IfTrue
, the function will mask the text embeddings.
Encodes the prompt into text encoder hidden states.