Diffusers

You are viewing v0.30.0 version. A newer version v0.34.0 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

CogVideoXTransformer3DModel

A Diffusion Transformer model for 3D data from CogVideoX was introduced in CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer by Tsinghua University & ZhipuAI.

The model can be loaded with the following code snippet.

from diffusers import CogVideoXTransformer3DModel

vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda")

CogVideoXTransformer3DModel

class diffusers.CogVideoXTransformer3DModel

< source >

( num_attention_heads: int = 30 attention_head_dim: int = 64 in_channels: Optional = 16 out_channels: Optional = 16 flip_sin_to_cos: bool = True freq_shift: int = 0 time_embed_dim: int = 512 text_embed_dim: int = 4096 num_layers: int = 30 dropout: float = 0.0 attention_bias: bool = True sample_width: int = 90 sample_height: int = 60 sample_frames: int = 49 patch_size: int = 2 temporal_compression_ratio: int = 4 max_text_seq_length: int = 226 activation_fn: str = 'gelu-approximate' timestep_activation_fn: str = 'silu' norm_elementwise_affine: bool = True norm_eps: float = 1e-05 spatial_interpolation_scale: float = 1.875 temporal_interpolation_scale: float = 1.0 )

Parameters

num_attention_heads (int, optional, defaults to 16) — The number of heads to use for multi-head attention.
attention_head_dim (int, optional, defaults to 88) — The number of channels in each head.
in_channels (int, optional) — The number of channels in the input.
out_channels (int, optional) — The number of channels in the output.
num_layers (int, optional, defaults to 1) — The number of layers of Transformer blocks to use.
dropout (float, optional, defaults to 0.0) — The dropout probability to use.
cross_attention_dim (int, optional) — The number of encoder_hidden_states dimensions to use.
attention_bias (bool, optional) — Configure if the TransformerBlocks attention should contain a bias parameter.
sample_size (int, optional) — The width of the latent images (specify if the input is discrete). This is fixed during training since it is used to learn a number of position embeddings.
patch_size (int, optional) — The size of the patches to use in the patch embedding layer.
activation_fn (str, optional, defaults to "geglu") — Activation function to use in feed-forward.
num_embeds_ada_norm ( int, optional) — The number of diffusion steps used during training. Pass if at least one of the norm_layers is AdaLayerNorm. This is fixed during training since it is used to learn a number of embeddings that are added to the hidden states. During inference, you can denoise for up to but not more steps than num_embeds_ada_norm.
norm_type (str, optional, defaults to "layer_norm") — The type of normalization to use. Options are "layer_norm" or "ada_layer_norm".
norm_elementwise_affine (bool, optional, defaults to True) — Whether or not to use elementwise affine in normalization layers.
norm_eps (float, optional, defaults to 1e-5) — The epsilon value to use in normalization layers.
caption_channels (int, optional) — The number of channels in the caption embeddings.
video_length (int, optional) — The number of frames in the video-like data.

A Transformer model for video-like data in CogVideoX.

Transformer2DModelOutput

class diffusers.models.modeling_outputs.Transformer2DModelOutput

< source >

( sample: torch.Tensor )

Parameters

sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) — The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.

The output of Transformer2DModel.

< > Update on GitHub

←LatteTransformer3DModel LuminaNextDiT2DModel→