HunyuanVideo
HunyuanVideo by Tencent.
Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at this https URL.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
Recommendations for inference:
- Both text encoders should be in
torch.float16
. - Transformer should be in
torch.bfloat16
. - VAE should be in
torch.float16
. num_frames
should be of the form4 * k + 1
, for example49
or129
.- For smaller resolution images, try lower values of
shift
(between2.0
to5.0
) in the Scheduler. For larger resolution images, try higher values (between7.0
and12.0
). The default value is7.0
for HunyuanVideo. - For more information about supported resolutions and other details, please refer to the original repository here.
HunyuanVideoPipeline
class diffusers.HunyuanVideoPipeline
< source >( text_encoder: LlamaModel tokenizer: LlamaTokenizerFast transformer: HunyuanVideoTransformer3DModel vae: AutoencoderKLHunyuanVideo scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: CLIPTextModel tokenizer_2: CLIPTokenizer )
Parameters
- text_encoder (
LlamaModel
) — Llava Llama3-8B. - tokenizer_2 (
LlamaTokenizer
) — Tokenizer from Llava Llama3-8B. - transformer (HunyuanVideoTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformer
to denoise the encoded image latents. - vae (AutoencoderKLHunyuanVideo) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder_2 (
CLIPTextModel
) — CLIP, specifically the clip-vit-large-patch14 variant. - tokenizer_2 (
CLIPTokenizer
) — Tokenizer of class CLIPTokenizer.
Pipeline for text-to-video generation using HunyuanVideo.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str]] = None height: int = 720 width: int = 1280 num_frames: int = 129 num_inference_steps: int = 50 sigmas: typing.List[float] = None guidance_scale: float = 6.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] prompt_template: typing.Dict[str, typing.Any] = {'template': '<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>', 'crop_start': 95} max_sequence_length: int = 256 ) → ~HunyuanVideoPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_2
andtext_encoder_2
. If not defined,prompt
is will be used instead. - height (
int
, defaults to720
) — The height in pixels of the generated image. - width (
int
, defaults to1280
) — The width in pixels of the generated image. - num_frames (
int
, defaults to129
) — The number of frames in the generated video. - num_inference_steps (
int
, defaults to50
) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - guidance_scale (
float
, defaults to6.0
) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. Note that the only available HunyuanVideo model is CFG-distilled, which means that traditional guidance between unconditional and conditional latent is not applied. - num_videos_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose betweenPIL.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return aHunyuanVideoPipelineOutput
instead of a plain tuple. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
~HunyuanVideoPipelineOutput
or tuple
If return_dict
is True
, HunyuanVideoPipelineOutput
is returned, otherwise a tuple
is returned
where the first element is a list with the generated images and the second element is a list of bool
s
indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
>>> from diffusers.utils import export_to_video
>>> model_id = "tencent/HunyuanVideo"
>>> transformer = HunyuanVideoTransformer3DModel.from_pretrained(
... model_id, subfolder="transformer", torch_dtype=torch.bfloat16
... )
>>> pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
>>> pipe.vae.enable_tiling()
>>> pipe.to("cuda")
>>> output = pipe(
... prompt="A cat walks on the grass, realistic",
... height=320,
... width=512,
... num_frames=61,
... num_inference_steps=30,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=15)
Disable sliced VAE decoding. If enable_vae_slicing
was previously enabled, this method will go back to
computing decoding in one step.
Disable tiled VAE decoding. If enable_vae_tiling
was previously enabled, this method will go back to
computing decoding in one step.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.
HunyuanVideoPipelineOutput
class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput
< source >( frames: Tensor )
Parameters
- frames (
torch.Tensor
,np.ndarray
, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of lengthbatch_size,
with each sub-list containing denoised PIL image sequences of lengthnum_frames.
It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width)
.
Output class for HunyuanVideo pipelines.