CogView3PlusTransformer2DModel
A Diffusion Transformer model for 2D data from CogView3Plus was introduced in CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion by Tsinghua University & ZhipuAI.
The model can be loaded with the following code snippet.
from diffusers import CogView3PlusTransformer2DModel
vae = CogView3PlusTransformer2DModel.from_pretrained("THUDM/CogView3Plus-3b", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
CogView3PlusTransformer2DModel
class diffusers.CogView3PlusTransformer2DModel
< source >( patch_size: int = 2 in_channels: int = 16 num_layers: int = 30 attention_head_dim: int = 40 num_attention_heads: int = 64 out_channels: int = 16 text_embed_dim: int = 4096 time_embed_dim: int = 512 condition_dim: int = 256 pos_embed_max_size: int = 128 sample_size: int = 128 )
Parameters
- patch_size (
int
, defaults to2
) — The size of the patches to use in the patch embedding layer. - in_channels (
int
, defaults to16
) — The number of channels in the input. - num_layers (
int
, defaults to30
) — The number of layers of Transformer blocks to use. - attention_head_dim (
int
, defaults to40
) — The number of channels in each head. - num_attention_heads (
int
, defaults to64
) — The number of heads to use for multi-head attention. - out_channels (
int
, defaults to16
) — The number of channels in the output. - text_embed_dim (
int
, defaults to4096
) — Input dimension of text embeddings from the text encoder. - time_embed_dim (
int
, defaults to512
) — Output dimension of timestep embeddings. - condition_dim (
int
, defaults to256
) — The embedding dimension of the input SDXL-style resolution conditions (original_size, target_size, crop_coords). - pos_embed_max_size (
int
, defaults to128
) — The maximum resolution of the positional embeddings, from which slices of shapeH x W
are taken and added to input patched latents, whereH
andW
are the latent height and width respectively. A value of 128 means that the maximum supported height and width for image generation is128 * vae_scale_factor * patch_size => 128 * 8 * 2 => 2048
. - sample_size (
int
, defaults to128
) — The base resolution of input latents. If height/width is not provided during generation, this value is used to determine the resolution assample_size * vae_scale_factor => 128 * 8 => 1024
The Transformer model introduced in CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion.
forward
< source >( hidden_states: Tensor encoder_hidden_states: Tensor timestep: LongTensor original_size: Tensor target_size: Tensor crop_coords: Tensor return_dict: bool = True ) → torch.Tensor
or ~models.transformer_2d.Transformer2DModelOutput
Parameters
- hidden_states (
torch.Tensor
) — Inputhidden_states
of shape(batch size, channel, height, width)
. - encoder_hidden_states (
torch.Tensor
) — Conditional embeddings (embeddings computed from the input conditions such as prompts) of shape(batch_size, sequence_len, text_embed_dim)
- timestep (
torch.LongTensor
) — Used to indicate denoising step. - original_size (
torch.Tensor
) — CogView3 uses SDXL-like micro-conditioning for original image size as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - target_size (
torch.Tensor
) — CogView3 uses SDXL-like micro-conditioning for target image size as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - crop_coords (
torch.Tensor
) — CogView3 uses SDXL-like micro-conditioning for crop coordinates as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutput
instead of a plain tuple.
Returns
torch.Tensor
or ~models.transformer_2d.Transformer2DModelOutput
The denoised latents using provided inputs as conditioning.
The CogView3PlusTransformer2DModel forward method.
set_attn_processor
< source >( processor: Union )
Parameters
- processor (
dict
ofAttentionProcessor
or onlyAttentionProcessor
) — The instantiated processor class or a dictionary of processor classes that will be set as the processor for allAttention
layers.If
processor
is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors.
Sets the attention processor to use to compute attention.
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
- sample (
torch.Tensor
of shape(batch_size, num_channels, height, width)
or(batch size, num_vector_embeds - 1, num_latent_pixels)
if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_states
input. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.