Diffusers documentation
HunyuanDiT2DModel
HunyuanDiT2DModel
A Diffusion Transformer model for 2D data from Hunyuan-DiT.
HunyuanDiT2DModel
class diffusers.HunyuanDiT2DModel
< source >( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: typing.Optional[int] = None patch_size: typing.Optional[int] = None activation_fn: str = 'gelu-approximate' sample_size = 32 hidden_size = 1152 num_layers: int = 28 mlp_ratio: float = 4.0 learn_sigma: bool = True cross_attention_dim: int = 1024 norm_type: str = 'layer_norm' cross_attention_dim_t5: int = 2048 pooled_projection_dim: int = 1024 text_len: int = 77 text_len_t5: int = 256 use_style_cond_and_image_meta_size: bool = True )
Parameters
- num_attention_heads (
int, optional, defaults to 16) — The number of heads to use for multi-head attention. - attention_head_dim (
int, optional, defaults to 88) — The number of channels in each head. - in_channels (
int, optional) — The number of channels in the input and output (specify if the input is continuous). - patch_size (
int, optional) — The size of the patch to use for the input. - activation_fn (
str, optional, defaults to"geglu") — Activation function to use in feed-forward. - sample_size (
int, optional) — The width of the latent images. This is fixed during training since it is used to learn a number of position embeddings. - dropout (
float, optional, defaults to 0.0) — The dropout probability to use. - cross_attention_dim (
int, optional) — The number of dimension in the clip text embedding. - hidden_size (
int, optional) — The size of hidden layer in the conditioning embedding layers. - num_layers (
int, optional, defaults to 1) — The number of layers of Transformer blocks to use. - mlp_ratio (
float, optional, defaults to 4.0) — The ratio of the hidden layer size to the input size. - learn_sigma (
bool, optional, defaults toTrue) — Whether to predict variance. - cross_attention_dim_t5 (
int, optional) — The number dimensions in t5 text embedding. - pooled_projection_dim (
int, optional) — The size of the pooled projection. - text_len (
int, optional) — The length of the clip text embedding. - text_len_t5 (
int, optional) — The length of the T5 text embedding. - use_style_cond_and_image_meta_size (
bool, optional) — Whether or not to use style condition and image meta size. True for version <=1.1, False for version >= 1.2
HunYuanDiT: Diffusion model with a Transformer backbone.
Inherit ModelMixin and ConfigMixin to be compatible with the sampler StableDiffusionPipeline of diffusers.
enable_forward_chunking
< source >( chunk_size: typing.Optional[int] = None dim: int = 0 )
Parameters
- chunk_size (
int, optional) — The chunk size of the feed-forward layers. If not specified, will run feed-forward layer individually over each tensor of dim=dim. - dim (
int, optional, defaults to0) — The dimension over which the feed-forward computation should be chunked. Choose between dim=0 (batch) or dim=1 (sequence length).
Sets the attention processor to use feed forward chunking.
forward
< source >( hidden_states timestep encoder_hidden_states = None text_embedding_mask = None encoder_hidden_states_t5 = None text_embedding_mask_t5 = None image_meta_size = None style = None image_rotary_emb = None controlnet_block_samples = None return_dict = True )
Parameters
- hidden_states (
torch.Tensorof shape(batch size, dim, height, width)) — The input tensor. - timestep (
torch.LongTensor, optional) — Used to indicate denoising step. - encoder_hidden_states (
torch.Tensorof shape(batch size, sequence len, embed dims), optional) — Conditional embeddings for cross attention layer. This is the output ofBertModel. - text_embedding_mask — torch.Tensor
An attention mask of shape
(batch, key_tokens)is applied toencoder_hidden_states. This is the output ofBertModel. - encoder_hidden_states_t5 (
torch.Tensorof shape(batch size, sequence len, embed dims), optional) — Conditional embeddings for cross attention layer. This is the output of T5 Text Encoder. - text_embedding_mask_t5 — torch.Tensor
An attention mask of shape
(batch, key_tokens)is applied toencoder_hidden_states. This is the output of T5 Text Encoder. - image_meta_size (torch.Tensor) — Conditional embedding indicate the image sizes
- style — torch.Tensor: Conditional embedding indicate the style
- image_rotary_emb (
torch.Tensor) — The image rotary embeddings to apply on query and key tensors during attention calculation. - return_dict — bool Whether to return a dictionary.
The HunyuanDiT2DModel forward method.
Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value) are fused. For cross-attention modules, key and value projection matrices are fused.
> This API is 🧪 experimental.
Disables custom attention processors and sets the default attention implementation.
Disables the fused QKV projection if enabled.
> This API is 🧪 experimental.