Diffusers documentation
HunyuanDiT2DModel
HunyuanDiT2DModel
A Diffusion Transformer model for 2D data from Hunyuan-DiT.
HunyuanDiT2DModel
class diffusers.HunyuanDiT2DModel
< source >( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: Optional = None patch_size: Optional = None activation_fn: str = 'gelu-approximate' sample_size = 32 hidden_size = 1152 num_layers: int = 28 mlp_ratio: float = 4.0 learn_sigma: bool = True cross_attention_dim: int = 1024 norm_type: str = 'layer_norm' cross_attention_dim_t5: int = 2048 pooled_projection_dim: int = 1024 text_len: int = 77 text_len_t5: int = 256 )
Parameters
- num_attention_heads (
int
, optional, defaults to 16) — The number of heads to use for multi-head attention. - attention_head_dim (
int
, optional, defaults to 88) — The number of channels in each head. - in_channels (
int
, optional) — The number of channels in the input and output (specify if the input is continuous). - patch_size (
int
, optional) — The size of the patch to use for the input. - activation_fn (
str
, optional, defaults to"geglu"
) — Activation function to use in feed-forward. - sample_size (
int
, optional) — The width of the latent images. This is fixed during training since it is used to learn a number of position embeddings. - dropout (
float
, optional, defaults to 0.0) — The dropout probability to use. - cross_attention_dim (
int
, optional) — The number of dimension in the clip text embedding. - hidden_size (
int
, optional) — The size of hidden layer in the conditioning embedding layers. - num_layers (
int
, optional, defaults to 1) — The number of layers of Transformer blocks to use. - mlp_ratio (
float
, optional, defaults to 4.0) — The ratio of the hidden layer size to the input size. - learn_sigma (
bool
, optional, defaults toTrue
) — Whether to predict variance. - cross_attention_dim_t5 (
int
, optional) — The number dimensions in t5 text embedding. - pooled_projection_dim (
int
, optional) — The size of the pooled projection. - text_len (
int
, optional) — The length of the clip text embedding. - text_len_t5 (
int
, optional) — The length of the T5 text embedding.
HunYuanDiT: Diffusion model with a Transformer backbone.
Inherit ModelMixin and ConfigMixin to be compatible with the sampler StableDiffusionPipeline of diffusers.
forward
< source >( hidden_states timestep encoder_hidden_states = None text_embedding_mask = None encoder_hidden_states_t5 = None text_embedding_mask_t5 = None image_meta_size = None style = None image_rotary_emb = None return_dict = True )
Parameters
- hidden_states (
torch.Tensor
of shape(batch size, dim, height, width)
) — The input tensor. - timestep (
torch.LongTensor
, optional) — Used to indicate denoising step. - encoder_hidden_states (
torch.Tensor
of shape(batch size, sequence len, embed dims)
, optional) — Conditional embeddings for cross attention layer. This is the output ofBertModel
. text_embedding_mask — torch.Tensor An attention mask of shape(batch, key_tokens)
is applied toencoder_hidden_states
. This is the output ofBertModel
. - encoder_hidden_states_t5 (
torch.Tensor
of shape(batch size, sequence len, embed dims)
, optional) — Conditional embeddings for cross attention layer. This is the output of T5 Text Encoder. text_embedding_mask_t5 — torch.Tensor An attention mask of shape(batch, key_tokens)
is applied toencoder_hidden_states
. This is the output of T5 Text Encoder. - image_meta_size (torch.Tensor) — Conditional embedding indicate the image sizes style — torch.Tensor: Conditional embedding indicate the style
- image_rotary_emb (
torch.Tensor
) — The image rotary embeddings to apply on query and key tensors during attention calculation. return_dict — bool Whether to return a dictionary.
The HunyuanDiT2DModel forward method.