Diffusers documentation
HunyuanVideo-1.5
HunyuanVideo-1.5
HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models.
You can find all the original HunyuanVideo checkpoints under the Tencent organization.
Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.
The examples below use a checkpoint from hunyuanvideo-community because the weights are stored in a layout compatible with Diffusers.
The example below demonstrates how to generate a video optimized for memory or inference speed.
<hfoptions id="usage"> <hfoption id="memory">Refer to the Reduce memory usage guide for more details about the various memory saving techniques.
import torch
from diffusers import AutoModel, HunyuanVideo15Pipeline
from diffusers.utils import export_to_video
pipeline = HunyuanVideo15Pipeline.from_pretrained(
"HunyuanVideo-1.5-Diffusers-480p_t2v",
torch_dtype=torch.bfloat16,
)
# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)Notes
HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.
- H100/H800:
_flash_3_hubor_flash_varlen_3 - A100/A800/RTX 4090:
flash_huborflash_varlen - Other GPUs:
sage_hub
- H100/H800:
Refer to the Attention backends guide for more details about using a different backend.
pipe.transformer.set_attention_backend("flash_hub") # or your preferred backend- HunyuanVideo15Pipeline use guider and does not take
guidance_scaleparameter at runtime.
You can check the default guider configuration using pipe.guider:
>>> pipe.guider
ClassifierFreeGuidance {
"_class_name": "ClassifierFreeGuidance",
"_diffusers_version": "0.36.0.dev0",
"enabled": true,
"guidance_rescale": 0.0,
"guidance_scale": 6.0,
"start": 0.0,
"stop": 1.0,
"use_original_formulation": false
}
State:
step: None
num_inference_steps: None
timestep: None
count_prepared: 0
enabled: True
num_conditions: 2To update guider configuration, you can run pipe.guider = pipe.guider.new(...)
pipe.guider = pipe.guider.new(guidance_scale=5.0)Read more on Guider here.
HunyuanVideo15Pipeline
class diffusers.HunyuanVideo15Pipeline
< source >( text_encoder: Qwen2_5_VLTextModel tokenizer: Qwen2Tokenizer transformer: HunyuanVideo15Transformer3DModel vae: AutoencoderKLHunyuanVideo15 scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: T5EncoderModel tokenizer_2: ByT5Tokenizer guider: ClassifierFreeGuidance )
Parameters
- transformer (HunyuanVideo15Transformer3DModel) — Conditional Transformer (MMDiT) architecture to denoise the encoded video latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded video latents. - vae (AutoencoderKLHunyuanVideo15) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder (
Qwen2.5-VL-7B-Instruct) — Qwen2.5-VL-7B-Instruct, specifically the Qwen2.5-VL-7B-Instruct variant. - tokenizer (
Qwen2Tokenizer) — Tokenizer of class [Qwen2Tokenizer]. - text_encoder_2 (
T5EncoderModel) — T5EncoderModel variant. - tokenizer_2 (
ByT5Tokenizer) — Tokenizer of class [ByT5Tokenizer] - guider (ClassifierFreeGuidance) — [ClassifierFreeGuidance]for classifier free guidance.
Pipeline for text-to-video generation using HunyuanVideo1.5.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_frames: int = 121 num_inference_steps: int = 50 sigmas: typing.List[float] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~HunyuanVideo15PipelineOutput or tuple
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embedsinstead. - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. - height (
int, optional) — The height in pixels of the generated video. - width (
int, optional) — The width in pixels of the generated video. - num_frames (
int, defaults to121) — The number of frames in the generated video. - num_inference_steps (
int, defaults to50) — The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference. - sigmas (
List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support asigmasargument in theirset_timestepsmethod. If not defined, the default behavior whennum_inference_stepsis passed will be used. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — Atorch.Generatorto make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from thepromptinput argument. - prompt_embeds_mask (
torch.Tensor, optional) — Pre-generated mask for prompt embeddings. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - negative_prompt_embeds_mask (
torch.Tensor, optional) — Pre-generated mask for negative prompt embeddings. - prompt_embeds_2 (
torch.Tensor, optional) — Pre-generated text embeddings from the second text encoder. Can be used to easily tweak text inputs. - prompt_embeds_mask_2 (
torch.Tensor, optional) — Pre-generated mask for prompt embeddings from the second text encoder. - negative_prompt_embeds_2 (
torch.Tensor, optional) — Pre-generated negative text embeddings from the second text encoder. - negative_prompt_embeds_mask_2 (
torch.Tensor, optional) — Pre-generated mask for negative prompt embeddings from the second text encoder. - output_type (
str, optional, defaults to"np") — The output format of the generated video. Choose between “np”, “pt”, or “latent”. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return aHunyuanVideo15PipelineOutputinstead of a plain tuple. - attention_kwargs (
dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor.
Returns
~HunyuanVideo15PipelineOutput or tuple
If return_dict is True, HunyuanVideo15PipelineOutput is returned, otherwise a tuple is
returned where the first element is a list with the generated videos.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import HunyuanVideo15Pipeline
>>> from diffusers.utils import export_to_video
>>> model_id = "hunyuanvideo-community/HunyuanVideo-1.5-480p_t2v"
>>> pipe = HunyuanVideo15Pipeline.from_pretrained(model_id, torch_dtype=torch.float16)
>>> pipe.vae.enable_tiling()
>>> pipe.to("cuda")
>>> output = pipe(
... prompt="A cat walks on the grass, realistic",
... num_inference_steps=50,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=15)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None batch_size: int = 1 num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - device — (
torch.device): torch device - batch_size (
int) — batch size of prompts, defaults to 1 - num_images_per_prompt (
int) — number of images that should be generated per prompt - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. If not provided, text embeddings will be generated frompromptinput argument. - prompt_embeds_mask (
torch.Tensor, optional) — Pre-generated text mask. If not provided, text mask will be generated frompromptinput argument. - prompt_embeds_2 (
torch.Tensor, optional) — Pre-generated glyph text embeddings from ByT5. If not provided, will be generated frompromptinput argument using self.tokenizer_2 and self.text_encoder_2. - prompt_embeds_mask_2 (
torch.Tensor, optional) — Pre-generated glyph text mask from ByT5. If not provided, will be generated frompromptinput argument using self.tokenizer_2 and self.text_encoder_2.
prepare_cond_latents_and_mask
< source >( latents dtype: typing.Optional[torch.dtype] device: typing.Optional[torch.device] ) → tuple
Prepare conditional latents and mask for t2v generation.
HunyuanVideo15ImageToVideoPipeline
class diffusers.HunyuanVideo15ImageToVideoPipeline
< source >( text_encoder: Qwen2_5_VLTextModel tokenizer: Qwen2Tokenizer transformer: HunyuanVideo15Transformer3DModel vae: AutoencoderKLHunyuanVideo15 scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: T5EncoderModel tokenizer_2: ByT5Tokenizer guider: ClassifierFreeGuidance image_encoder: SiglipVisionModel feature_extractor: SiglipImageProcessor )
Parameters
- transformer (HunyuanVideo15Transformer3DModel) — Conditional Transformer (MMDiT) architecture to denoise the encoded video latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded video latents. - vae (AutoencoderKLHunyuanVideo15) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder (
Qwen2.5-VL-7B-Instruct) — Qwen2.5-VL-7B-Instruct, specifically the Qwen2.5-VL-7B-Instruct variant. - tokenizer (
Qwen2Tokenizer) — Tokenizer of class [Qwen2Tokenizer]. - text_encoder_2 (
T5EncoderModel) — T5EncoderModel variant. - tokenizer_2 (
ByT5Tokenizer) — Tokenizer of class [ByT5Tokenizer] - guider (ClassifierFreeGuidance) — [ClassifierFreeGuidance]for classifier free guidance.
- image_encoder (
SiglipVisionModel) — SiglipVisionModel variant. - feature_extractor (
SiglipImageProcessor) — SiglipImageProcessor variant.
Pipeline for image-to-video generation using HunyuanVideo1.5.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( image: Image prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None num_frames: int = 121 num_inference_steps: int = 50 sigmas: typing.List[float] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None ) → ~HunyuanVideo15PipelineOutput or tuple
Parameters
- image (
PIL.Image.Image) — The input image to condition video generation on. - prompt (
strorList[str], optional) — The prompt or prompts to guide the video generation. If not defined, one has to passprompt_embedsinstead. - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the video generation. If not defined, one has to passnegative_prompt_embedsinstead. - num_frames (
int, defaults to121) — The number of frames in the generated video. - num_inference_steps (
int, defaults to50) — The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference. - sigmas (
List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support asigmasargument in theirset_timestepsmethod. If not defined, the default behavior whennum_inference_stepsis passed will be used. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — Atorch.Generatorto make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from thepromptinput argument. - prompt_embeds_mask (
torch.Tensor, optional) — Pre-generated mask for prompt embeddings. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - negative_prompt_embeds_mask (
torch.Tensor, optional) — Pre-generated mask for negative prompt embeddings. - prompt_embeds_2 (
torch.Tensor, optional) — Pre-generated text embeddings from the second text encoder. Can be used to easily tweak text inputs. - prompt_embeds_mask_2 (
torch.Tensor, optional) — Pre-generated mask for prompt embeddings from the second text encoder. - negative_prompt_embeds_2 (
torch.Tensor, optional) — Pre-generated negative text embeddings from the second text encoder. - negative_prompt_embeds_mask_2 (
torch.Tensor, optional) — Pre-generated mask for negative prompt embeddings from the second text encoder. - output_type (
str, optional, defaults to"np") — The output format of the generated video. Choose between “np”, “pt”, or “latent”. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return aHunyuanVideo15PipelineOutputinstead of a plain tuple. - attention_kwargs (
dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor.
Returns
~HunyuanVideo15PipelineOutput or tuple
If return_dict is True, HunyuanVideo15PipelineOutput is returned, otherwise a tuple is
returned where the first element is a list with the generated videos.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import HunyuanVideo15ImageToVideoPipeline
>>> from diffusers.utils import export_to_video
>>> model_id = "hunyuanvideo-community/HunyuanVideo-1.5-480p_i2v"
>>> pipe = HunyuanVideo15ImageToVideoPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
>>> pipe.vae.enable_tiling()
>>> pipe.to("cuda")
>>> image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG")
>>> output = pipe(
... prompt="Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.",
... image=image,
... num_inference_steps=50,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=24)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None batch_size: int = 1 num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_mask: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_embeds_mask_2: typing.Optional[torch.Tensor] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - device — (
torch.device): torch device - batch_size (
int) — batch size of prompts, defaults to 1 - num_images_per_prompt (
int) — number of images that should be generated per prompt - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. If not provided, text embeddings will be generated frompromptinput argument. - prompt_embeds_mask (
torch.Tensor, optional) — Pre-generated text mask. If not provided, text mask will be generated frompromptinput argument. - prompt_embeds_2 (
torch.Tensor, optional) — Pre-generated glyph text embeddings from ByT5. If not provided, will be generated frompromptinput argument using self.tokenizer_2 and self.text_encoder_2. - prompt_embeds_mask_2 (
torch.Tensor, optional) — Pre-generated glyph text mask from ByT5. If not provided, will be generated frompromptinput argument using self.tokenizer_2 and self.text_encoder_2.
prepare_cond_latents_and_mask
< source >( latents: Tensor image: Image batch_size: int height: int width: int dtype: dtype device: device ) → tuple
Prepare conditional latents and mask for t2v generation.
HunyuanVideo15PipelineOutput
class diffusers.pipelines.hunyuan_video1_5.pipeline_output.HunyuanVideo15PipelineOutput
< source >( frames: Tensor )
Parameters
- frames (
torch.Tensor,np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of lengthbatch_size,with each sub-list containing denoised PIL image sequences of lengthnum_frames.It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width).
Output class for HunyuanVideo1.5 pipelines.