Diffusers

You are viewing v0.30.2 version. A newer version v0.35.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

LuminaNextDiT2DModel

A Next Version of Diffusion Transformer model for 2D data from Lumina-T2X.

LuminaNextDiT2DModel

class diffusers.LuminaNextDiT2DModel

< source >

( sample_size: int = 128 patch_size: Optional = 2 in_channels: Optional = 4 hidden_size: Optional = 2304 num_layers: Optional = 32 num_attention_heads: Optional = 32 num_kv_heads: Optional = None multiple_of: Optional = 256 ffn_dim_multiplier: Optional = None norm_eps: Optional = 1e-05 learn_sigma: Optional = True qk_norm: Optional = True cross_attention_dim: Optional = 2048 scaling_factor: Optional = 1.0 )

Parameters

sample_size (int) — The width of the latent images. This is fixed during training since it is used to learn a number of position embeddings.
patch_size (int, optional, (int, optional, defaults to 2) — The size of each patch in the image. This parameter defines the resolution of patches fed into the model.
in_channels (int, optional, defaults to 4) — The number of input channels for the model. Typically, this matches the number of channels in the input images.
hidden_size (int, optional, defaults to 4096) — The dimensionality of the hidden layers in the model. This parameter determines the width of the model’s hidden representations.
num_layers (int, optional, default to 32) — The number of layers in the model. This defines the depth of the neural network.
num_attention_heads (int, optional, defaults to 32) — The number of attention heads in each attention layer. This parameter specifies how many separate attention mechanisms are used.
num_kv_heads (int, optional, defaults to 8) — The number of key-value heads in the attention mechanism, if different from the number of attention heads. If None, it defaults to num_attention_heads.
multiple_of (int, optional, defaults to 256) — A factor that the hidden size should be a multiple of. This can help optimize certain hardware configurations.
ffn_dim_multiplier (float, optional) — A multiplier for the dimensionality of the feed-forward network. If None, it uses a default value based on the model configuration.
norm_eps (float, optional, defaults to 1e-5) — A small value added to the denominator for numerical stability in normalization layers.
learn_sigma (bool, optional, defaults to True) — Whether the model should learn the sigma parameter, which might be related to uncertainty or variance in predictions.
qk_norm (bool, optional, defaults to True) — Indicates if the queries and keys in the attention mechanism should be normalized.
cross_attention_dim (int, optional, defaults to 2048) — The dimensionality of the text embeddings. This parameter defines the size of the text representations used in the model.
scaling_factor (float, optional, defaults to 1.0) — A scaling factor applied to certain parameters or layers in the model. This can be used for adjusting the overall scale of the model’s operations.

LuminaNextDiT: Diffusion model with a Transformer backbone.

Inherit ModelMixin and ConfigMixin to be compatible with the sampler StableDiffusionPipeline of diffusers.

forward

< source >

( hidden_states: Tensor timestep: Tensor encoder_hidden_states: Tensor encoder_mask: Tensor image_rotary_emb: Tensor cross_attention_kwargs: Dict = None return_dict = True )

Parameters

hidden_states (torch.Tensor) — Input tensor of shape (N, C, H, W).
timestep (torch.Tensor) — Tensor of diffusion timesteps of shape (N,).
encoder_hidden_states (torch.Tensor) — Tensor of caption features of shape (N, D).
encoder_mask (torch.Tensor) — Tensor of caption masks of shape (N, L).

Forward pass of LuminaNextDiT.

< > Update on GitHub

←CogVideoXTransformer3DModel TransformerTemporalModel→