Diffusers documentation

HunyuanDiT2DModel

You are viewing v0.28.2 version. A newer version v0.29.2 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

HunyuanDiT2DModel

A Diffusion Transformer model for 2D data from Hunyuan-DiT.

HunyuanDiT2DModel

class diffusers.HunyuanDiT2DModel

< >

( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: Optional = None patch_size: Optional = None activation_fn: str = 'gelu-approximate' sample_size = 32 hidden_size = 1152 num_layers: int = 28 mlp_ratio: float = 4.0 learn_sigma: bool = True cross_attention_dim: int = 1024 norm_type: str = 'layer_norm' cross_attention_dim_t5: int = 2048 pooled_projection_dim: int = 1024 text_len: int = 77 text_len_t5: int = 256 )

Parameters

  • num_attention_heads (int, optional, defaults to 16) — The number of heads to use for multi-head attention.
  • attention_head_dim (int, optional, defaults to 88) — The number of channels in each head.
  • in_channels (int, optional) — The number of channels in the input and output (specify if the input is continuous).
  • patch_size (int, optional) — The size of the patch to use for the input.
  • activation_fn (str, optional, defaults to "geglu") — Activation function to use in feed-forward.
  • sample_size (int, optional) — The width of the latent images. This is fixed during training since it is used to learn a number of position embeddings.
  • dropout (float, optional, defaults to 0.0) — The dropout probability to use.
  • cross_attention_dim (int, optional) — The number of dimension in the clip text embedding.
  • hidden_size (int, optional) — The size of hidden layer in the conditioning embedding layers.
  • num_layers (int, optional, defaults to 1) — The number of layers of Transformer blocks to use.
  • mlp_ratio (float, optional, defaults to 4.0) — The ratio of the hidden layer size to the input size.
  • learn_sigma (bool, optional, defaults to True) — Whether to predict variance.
  • cross_attention_dim_t5 (int, optional) — The number dimensions in t5 text embedding.
  • pooled_projection_dim (int, optional) — The size of the pooled projection.
  • text_len (int, optional) — The length of the clip text embedding.
  • text_len_t5 (int, optional) — The length of the T5 text embedding.

HunYuanDiT: Diffusion model with a Transformer backbone.

Inherit ModelMixin and ConfigMixin to be compatible with the sampler StableDiffusionPipeline of diffusers.

forward

< >

( hidden_states timestep encoder_hidden_states = None text_embedding_mask = None encoder_hidden_states_t5 = None text_embedding_mask_t5 = None image_meta_size = None style = None image_rotary_emb = None return_dict = True )

Parameters

  • hidden_states (torch.Tensor of shape (batch size, dim, height, width)) — The input tensor.
  • timestep ( torch.LongTensor, optional) — Used to indicate denoising step.
  • encoder_hidden_states ( torch.Tensor of shape (batch size, sequence len, embed dims), optional) — Conditional embeddings for cross attention layer. This is the output of BertModel. text_embedding_mask — torch.Tensor An attention mask of shape (batch, key_tokens) is applied to encoder_hidden_states. This is the output of BertModel.
  • encoder_hidden_states_t5 ( torch.Tensor of shape (batch size, sequence len, embed dims), optional) — Conditional embeddings for cross attention layer. This is the output of T5 Text Encoder. text_embedding_mask_t5 — torch.Tensor An attention mask of shape (batch, key_tokens) is applied to encoder_hidden_states. This is the output of T5 Text Encoder.
  • image_meta_size (torch.Tensor) — Conditional embedding indicate the image sizes style — torch.Tensor: Conditional embedding indicate the style
  • image_rotary_emb (torch.Tensor) — The image rotary embeddings to apply on query and key tensors during attention calculation. return_dict — bool Whether to return a dictionary.

The HunyuanDiT2DModel forward method.

< > Update on GitHub