CogVideoXTransformer3DModel
A Diffusion Transformer model for 3D data from CogVideoX was introduced in CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer by Tsinghua University & ZhipuAI.
The model can be loaded with the following code snippet.
from diffusers import CogVideoXTransformer3DModel
vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda")
CogVideoXTransformer3DModel
class diffusers.CogVideoXTransformer3DModel
< source >( num_attention_heads: int = 30 attention_head_dim: int = 64 in_channels: int = 16 out_channels: Optional = 16 flip_sin_to_cos: bool = True freq_shift: int = 0 time_embed_dim: int = 512 text_embed_dim: int = 4096 num_layers: int = 30 dropout: float = 0.0 attention_bias: bool = True sample_width: int = 90 sample_height: int = 60 sample_frames: int = 49 patch_size: int = 2 temporal_compression_ratio: int = 4 max_text_seq_length: int = 226 activation_fn: str = 'gelu-approximate' timestep_activation_fn: str = 'silu' norm_elementwise_affine: bool = True norm_eps: float = 1e-05 spatial_interpolation_scale: float = 1.875 temporal_interpolation_scale: float = 1.0 use_rotary_positional_embeddings: bool = False )
Parameters
- num_attention_heads (
int
, defaults to30
) — The number of heads to use for multi-head attention. - attention_head_dim (
int
, defaults to64
) — The number of channels in each head. - in_channels (
int
, defaults to16
) — The number of channels in the input. - out_channels (
int
, optional, defaults to16
) — The number of channels in the output. - flip_sin_to_cos (
bool
, defaults toTrue
) — Whether to flip the sin to cos in the time embedding. - time_embed_dim (
int
, defaults to512
) — Output dimension of timestep embeddings. - text_embed_dim (
int
, defaults to4096
) — Input dimension of text embeddings from the text encoder. - num_layers (
int
, defaults to30
) — The number of layers of Transformer blocks to use. - dropout (
float
, defaults to0.0
) — The dropout probability to use. - attention_bias (
bool
, defaults toTrue
) — Whether or not to use bias in the attention projection layers. - sample_width (
int
, defaults to90
) — The width of the input latents. - sample_height (
int
, defaults to60
) — The height of the input latents. - sample_frames (
int
, defaults to49
) — The number of frames in the input latents. Note that this parameter was incorrectly initialized to 49 instead of 13 because CogVideoX processed 13 latent frames at once in its default and recommended settings, but cannot be changed to the correct value to ensure backwards compatibility. To create a transformer with K latent frames, the correct value to pass here would be: ((K - 1) * temporal_compression_ratio + 1). - patch_size (
int
, defaults to2
) — The size of the patches to use in the patch embedding layer. - temporal_compression_ratio (
int
, defaults to4
) — The compression ratio across the temporal dimension. See documentation forsample_frames
. - max_text_seq_length (
int
, defaults to226
) — The maximum sequence length of the input text embeddings. - activation_fn (
str
, defaults to"gelu-approximate"
) — Activation function to use in feed-forward. - timestep_activation_fn (
str
, defaults to"silu"
) — Activation function to use when generating the timestep embeddings. - norm_elementwise_affine (
bool
, defaults toTrue
) — Whether or not to use elementwise affine in normalization layers. - norm_eps (
float
, defaults to1e-5
) — The epsilon value to use in normalization layers. - spatial_interpolation_scale (
float
, defaults to1.875
) — Scaling factor to apply in 3D positional embeddings across spatial dimensions. - temporal_interpolation_scale (
float
, defaults to1.0
) — Scaling factor to apply in 3D positional embeddings across temporal dimensions.
A Transformer model for video-like data in CogVideoX.
Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value) are fused. For cross-attention modules, key and value projection matrices are fused.
This API is 🧪 experimental.
set_attn_processor
< source >( processor: Union )
Parameters
- processor (
dict
ofAttentionProcessor
or onlyAttentionProcessor
) — The instantiated processor class or a dictionary of processor classes that will be set as the processor for allAttention
layers.If
processor
is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors.
Sets the attention processor to use to compute attention.
Disables the fused QKV projection if enabled.
This API is 🧪 experimental.
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
- sample (
torch.Tensor
of shape(batch_size, num_channels, height, width)
or(batch size, num_vector_embeds - 1, num_latent_pixels)
if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_states
input. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.