Diffusers documentation

TransformerTemporalModel

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.31.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

TransformerTemporalModel

A Transformer model for video-like data.

TransformerTemporalModel

class diffusers.models.TransformerTemporalModel

< >

( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: typing.Optional[int] = None out_channels: typing.Optional[int] = None num_layers: int = 1 dropout: float = 0.0 norm_num_groups: int = 32 cross_attention_dim: typing.Optional[int] = None attention_bias: bool = False sample_size: typing.Optional[int] = None activation_fn: str = 'geglu' norm_elementwise_affine: bool = True double_self_attention: bool = True positional_embeddings: typing.Optional[str] = None num_positional_embeddings: typing.Optional[int] = None )

Parameters

  • num_attention_heads (int, optional, defaults to 16) — The number of heads to use for multi-head attention.
  • attention_head_dim (int, optional, defaults to 88) — The number of channels in each head.
  • in_channels (int, optional) — The number of channels in the input and output (specify if the input is continuous).
  • num_layers (int, optional, defaults to 1) — The number of layers of Transformer blocks to use.
  • dropout (float, optional, defaults to 0.0) — The dropout probability to use.
  • cross_attention_dim (int, optional) — The number of encoder_hidden_states dimensions to use.
  • attention_bias (bool, optional) — Configure if the TransformerBlock attention should contain a bias parameter.
  • sample_size (int, optional) — The width of the latent images (specify if the input is discrete). This is fixed during training since it is used to learn a number of position embeddings.
  • activation_fn (str, optional, defaults to "geglu") — Activation function to use in feed-forward. See diffusers.models.activations.get_activation for supported activation functions.
  • norm_elementwise_affine (bool, optional) — Configure if the TransformerBlock should use learnable elementwise affine parameters for normalization.
  • double_self_attention (bool, optional) — Configure if each TransformerBlock should contain two self-attention layers. positional_embeddings — (str, optional): The type of positional embeddings to apply to the sequence input before passing use. num_positional_embeddings — (int, optional): The maximum length of the sequence over which to apply positional embeddings.

A Transformer model for video-like data.

forward

< >

( hidden_states: Tensor encoder_hidden_states: typing.Optional[torch.LongTensor] = None timestep: typing.Optional[torch.LongTensor] = None class_labels: LongTensor = None num_frames: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None return_dict: bool = True ) TransformerTemporalModelOutput or tuple

Parameters

  • hidden_states (torch.LongTensor of shape (batch size, num latent pixels) if discrete, torch.Tensor of shape (batch size, channel, height, width) if continuous) — Input hidden_states.
  • encoder_hidden_states ( torch.LongTensor of shape (batch size, encoder_hidden_states dim), optional) — Conditional embeddings for cross attention layer. If not given, cross-attention defaults to self-attention.
  • timestep ( torch.LongTensor, optional) — Used to indicate denoising step. Optional timestep to be applied as an embedding in AdaLayerNorm.
  • class_labels ( torch.LongTensor of shape (batch size, num classes), optional) — Used to indicate class labels conditioning. Optional class labels to be applied as an embedding in AdaLayerZeroNorm.
  • num_frames (int, optional, defaults to 1) — The number of frames to be processed per batch. This is used to reshape the hidden states.
  • cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a TransformerTemporalModelOutput instead of a plain tuple.

Returns

TransformerTemporalModelOutput or tuple

If return_dict is True, an TransformerTemporalModelOutput is returned, otherwise a tuple where the first element is the sample tensor.

The TransformerTemporal forward method.

TransformerTemporalModelOutput

class diffusers.models.transformers.transformer_temporal.TransformerTemporalModelOutput

< >

( sample: Tensor )

Parameters

  • sample (torch.Tensor of shape (batch_size x num_frames, num_channels, height, width)) — The hidden states output conditioned on encoder_hidden_states input.

The output of TransformerTemporalModel.

< > Update on GitHub