HunyuanDiT2DModel

A Diffusion Transformer model for 2D data from Hunyuan-DiT.

HunyuanDiT2DModel

class diffusers.HunyuanDiT2DModel

( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: Optional = None patch_size: Optional = None activation_fn: str = 'gelu-approximate' sample_size = 32 hidden_size = 1152 num_layers: int = 28 mlp_ratio: float = 4.0 learn_sigma: bool = True cross_attention_dim: int = 1024 norm_type: str = 'layer_norm' cross_attention_dim_t5: int = 2048 pooled_projection_dim: int = 1024 text_len: int = 77 text_len_t5: int = 256 )

Parameters

num_attention_heads (int, optional, defaults to 16) — The number of heads to use for multi-head attention.
attention_head_dim (int, optional, defaults to 88) — The number of channels in each head.
in_channels (int, optional) — The number of channels in the input and output (specify if the input is continuous).
patch_size (int, optional) — The size of the patch to use for the input.
activation_fn (str, optional, defaults to "geglu") — Activation function to use in feed-forward.
sample_size (int, optional) — The width of the latent images. This is fixed during training since it is used to learn a number of position embeddings.
dropout (float, optional, defaults to 0.0) — The dropout probability to use.
cross_attention_dim (int, optional) — The number of dimension in the clip text embedding.
hidden_size (int, optional) — The size of hidden layer in the conditioning embedding layers.
num_layers (int, optional, defaults to 1) — The number of layers of Transformer blocks to use.
mlp_ratio (float, optional, defaults to 4.0) — The ratio of the hidden layer size to the input size.
learn_sigma (bool, optional, defaults to True) — Whether to predict variance.
cross_attention_dim_t5 (int, optional) — The number dimensions in t5 text embedding.
pooled_projection_dim (int, optional) — The size of the pooled projection.
text_len (int, optional) — The length of the clip text embedding.
text_len_t5 (int, optional) — The length of the T5 text embedding.

HunYuanDiT: Diffusion model with a Transformer backbone.

Inherit ModelMixin and ConfigMixin to be compatible with the sampler StableDiffusionPipeline of diffusers.

forward

< source >

( hidden_states timestep encoder_hidden_states = None text_embedding_mask = None encoder_hidden_states_t5 = None text_embedding_mask_t5 = None image_meta_size = None style = None image_rotary_emb = None return_dict = True )

Parameters

hidden_states (torch.Tensor of shape (batch size, dim, height, width)) — The input tensor.
timestep ( torch.LongTensor, optional) — Used to indicate denoising step.
encoder_hidden_states ( torch.Tensor of shape (batch size, sequence len, embed dims), optional) — Conditional embeddings for cross attention layer. This is the output of BertModel. text_embedding_mask — torch.Tensor An attention mask of shape (batch, key_tokens) is applied to encoder_hidden_states. This is the output of BertModel.
encoder_hidden_states_t5 ( torch.Tensor of shape (batch size, sequence len, embed dims), optional) — Conditional embeddings for cross attention layer. This is the output of T5 Text Encoder. text_embedding_mask_t5 — torch.Tensor An attention mask of shape (batch, key_tokens) is applied to encoder_hidden_states. This is the output of T5 Text Encoder.
image_meta_size (torch.Tensor) — Conditional embedding indicate the image sizes style — torch.Tensor: Conditional embedding indicate the style
image_rotary_emb (torch.Tensor) — The image rotary embeddings to apply on query and key tensors during attention calculation. return_dict — bool Whether to return a dictionary.

The HunyuanDiT2DModel forward method.

< > Update on GitHub

Diffusers

HunyuanDiT2DModel