Diffusers

You are viewing v0.21.0 version. A newer version v0.35.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Transformer Temporal

A Transformer model for video-like data.

TransformerTemporalModel

class diffusers.models.TransformerTemporalModel

< source >

( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: typing.Optional[int] = None out_channels: typing.Optional[int] = None num_layers: int = 1 dropout: float = 0.0 norm_num_groups: int = 32 cross_attention_dim: typing.Optional[int] = None attention_bias: bool = False sample_size: typing.Optional[int] = None activation_fn: str = 'geglu' norm_elementwise_affine: bool = True double_self_attention: bool = True )

Parameters

num_attention_heads (int, optional, defaults to 16) — The number of heads to use for multi-head attention.
attention_head_dim (int, optional, defaults to 88) — The number of channels in each head.
in_channels (int, optional) — The number of channels in the input and output (specify if the input is continuous).
num_layers (int, optional, defaults to 1) — The number of layers of Transformer blocks to use.
dropout (float, optional, defaults to 0.0) — The dropout probability to use.
cross_attention_dim (int, optional) — The number of encoder_hidden_states dimensions to use.
sample_size (int, optional) — The width of the latent images (specify if the input is discrete). This is fixed during training since it is used to learn a number of position embeddings.
activation_fn (str, optional, defaults to "geglu") — Activation function to use in feed-forward.
attention_bias (bool, optional) — Configure if the TransformerBlock attention should contain a bias parameter.
double_self_attention (bool, optional) — Configure if each TransformerBlock should contain two self-attention layers.

A Transformer model for video-like data.

forward

< source >

( hidden_states encoder_hidden_states = None timestep = None class_labels = None num_frames = 1 cross_attention_kwargs = None return_dict: bool = True ) → TransformerTemporalModelOutput or tuple

Parameters

hidden_states (torch.LongTensor of shape (batch size, num latent pixels) if discrete, torch.FloatTensor of shape (batch size, channel, height, width) if continuous) — Input hidden_states.
encoder_hidden_states ( torch.LongTensor of shape (batch size, encoder_hidden_states dim), optional) — Conditional embeddings for cross attention layer. If not given, cross-attention defaults to self-attention.
timestep ( torch.long, optional) — Used to indicate denoising step. Optional timestep to be applied as an embedding in AdaLayerNorm.
class_labels ( torch.LongTensor of shape (batch size, num classes), optional) — Used to indicate class labels conditioning. Optional class labels to be applied as an embedding in AdaLayerZeroNorm.
return_dict (bool, optional, defaults to True) — Whether or not to return a UNet2DConditionOutput instead of a plain tuple.

Returns

TransformerTemporalModelOutput or tuple

If return_dict is True, an TransformerTemporalModelOutput is returned, otherwise a tuple where the first element is the sample tensor.

The TransformerTemporal forward method.

TransformerTemporalModelOutput

class diffusers.models.transformer_temporal.TransformerTemporalModelOutput

< source >

( sample: FloatTensor )

Parameters

sample (torch.FloatTensor of shape (batch_size x num_frames, num_channels, height, width)) — The hidden states output conditioned on encoder_hidden_states input.

The output of TransformerTemporalModel.

←Transformer2D Prior Transformer→