Diffusers documentation

CogVideoXTransformer3DModel

You are viewing v0.30.2 version. A newer version v0.31.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

CogVideoXTransformer3DModel

A Diffusion Transformer model for 3D data from CogVideoX was introduced in CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer by Tsinghua University & ZhipuAI.

The model can be loaded with the following code snippet.

from diffusers import CogVideoXTransformer3DModel

vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda")

CogVideoXTransformer3DModel

class diffusers.CogVideoXTransformer3DModel

< >

( num_attention_heads: int = 30 attention_head_dim: int = 64 in_channels: int = 16 out_channels: Optional = 16 flip_sin_to_cos: bool = True freq_shift: int = 0 time_embed_dim: int = 512 text_embed_dim: int = 4096 num_layers: int = 30 dropout: float = 0.0 attention_bias: bool = True sample_width: int = 90 sample_height: int = 60 sample_frames: int = 49 patch_size: int = 2 temporal_compression_ratio: int = 4 max_text_seq_length: int = 226 activation_fn: str = 'gelu-approximate' timestep_activation_fn: str = 'silu' norm_elementwise_affine: bool = True norm_eps: float = 1e-05 spatial_interpolation_scale: float = 1.875 temporal_interpolation_scale: float = 1.0 use_rotary_positional_embeddings: bool = False )

Parameters

  • num_attention_heads (int, defaults to 30) — The number of heads to use for multi-head attention.
  • attention_head_dim (int, defaults to 64) — The number of channels in each head.
  • in_channels (int, defaults to 16) — The number of channels in the input.
  • out_channels (int, optional, defaults to 16) — The number of channels in the output.
  • flip_sin_to_cos (bool, defaults to True) — Whether to flip the sin to cos in the time embedding.
  • time_embed_dim (int, defaults to 512) — Output dimension of timestep embeddings.
  • text_embed_dim (int, defaults to 4096) — Input dimension of text embeddings from the text encoder.
  • num_layers (int, defaults to 30) — The number of layers of Transformer blocks to use.
  • dropout (float, defaults to 0.0) — The dropout probability to use.
  • attention_bias (bool, defaults to True) — Whether or not to use bias in the attention projection layers.
  • sample_width (int, defaults to 90) — The width of the input latents.
  • sample_height (int, defaults to 60) — The height of the input latents.
  • sample_frames (int, defaults to 49) — The number of frames in the input latents. Note that this parameter was incorrectly initialized to 49 instead of 13 because CogVideoX processed 13 latent frames at once in its default and recommended settings, but cannot be changed to the correct value to ensure backwards compatibility. To create a transformer with K latent frames, the correct value to pass here would be: ((K - 1) * temporal_compression_ratio + 1).
  • patch_size (int, defaults to 2) — The size of the patches to use in the patch embedding layer.
  • temporal_compression_ratio (int, defaults to 4) — The compression ratio across the temporal dimension. See documentation for sample_frames.
  • max_text_seq_length (int, defaults to 226) — The maximum sequence length of the input text embeddings.
  • activation_fn (str, defaults to "gelu-approximate") — Activation function to use in feed-forward.
  • timestep_activation_fn (str, defaults to "silu") — Activation function to use when generating the timestep embeddings.
  • norm_elementwise_affine (bool, defaults to True) — Whether or not to use elementwise affine in normalization layers.
  • norm_eps (float, defaults to 1e-5) — The epsilon value to use in normalization layers.
  • spatial_interpolation_scale (float, defaults to 1.875) — Scaling factor to apply in 3D positional embeddings across spatial dimensions.
  • temporal_interpolation_scale (float, defaults to 1.0) — Scaling factor to apply in 3D positional embeddings across temporal dimensions.

A Transformer model for video-like data in CogVideoX.

fuse_qkv_projections

< >

( )

Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value) are fused. For cross-attention modules, key and value projection matrices are fused.

This API is 🧪 experimental.

set_attn_processor

< >

( processor: Union )

Parameters

  • processor (dict of AttentionProcessor or only AttentionProcessor) — The instantiated processor class or a dictionary of processor classes that will be set as the processor for all Attention layers.

    If processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors.

Sets the attention processor to use to compute attention.

unfuse_qkv_projections

< >

( )

Disables the fused QKV projection if enabled.

This API is 🧪 experimental.

Transformer2DModelOutput

class diffusers.models.modeling_outputs.Transformer2DModelOutput

< >

( sample: torch.Tensor )

Parameters

  • sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) — The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.

The output of Transformer2DModel.

< > Update on GitHub