Diffusers documentation
ChronoEditTransformer3DModel
ChronoEditTransformer3DModel
A Diffusion Transformer model for 3D video-like data from ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling.
TL;DR: ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.
The model can be loaded with the following code snippet.
from diffusers import ChronoEditTransformer3DModel
transformer = ChronoEditTransformer3DModel.from_pretrained("nvidia/ChronoEdit-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)ChronoEditTransformer3DModel
class diffusers.ChronoEditTransformer3DModel
< source >( patch_size: typing.Tuple[int] = (1, 2, 2) num_attention_heads: int = 40 attention_head_dim: int = 128 in_channels: int = 16 out_channels: int = 16 text_dim: int = 4096 freq_dim: int = 256 ffn_dim: int = 13824 num_layers: int = 40 cross_attn_norm: bool = True qk_norm: typing.Optional[str] = 'rms_norm_across_heads' eps: float = 1e-06 image_dim: typing.Optional[int] = None added_kv_proj_dim: typing.Optional[int] = None rope_max_seq_len: int = 1024 pos_embed_seq_len: typing.Optional[int] = None rope_temporal_skip_len: int = 8 )
Parameters
- patch_size (
Tuple[int], defaults to(1, 2, 2)) — 3D patch dimensions for video embedding (t_patch, h_patch, w_patch). - num_attention_heads (
int, defaults to40) — Fixed length for text embeddings. - attention_head_dim (
int, defaults to128) — The number of channels in each head. - in_channels (
int, defaults to16) — The number of channels in the input. - out_channels (
int, defaults to16) — The number of channels in the output. - text_dim (
int, defaults to512) — Input dimension for text embeddings. - freq_dim (
int, defaults to256) — Dimension for sinusoidal time embeddings. - ffn_dim (
int, defaults to13824) — Intermediate dimension in feed-forward network. - num_layers (
int, defaults to40) — The number of layers of transformer blocks to use. - window_size (
Tuple[int], defaults to(-1, -1)) — Window size for local attention (-1 indicates global attention). - cross_attn_norm (
bool, defaults toTrue) — Enable cross-attention normalization. - qk_norm (
bool, defaults toTrue) — Enable query/key normalization. - eps (
float, defaults to1e-6) — Epsilon value for normalization layers. - add_img_emb (
bool, defaults toFalse) — Whether to use img_emb. - added_kv_proj_dim (
int, optional, defaults toNone) — The number of channels to use for the added key and value projections. IfNone, no projection is used.
A Transformer model for video-like data used in the ChronoEdit model.
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
- sample (
torch.Tensorof shape(batch_size, num_channels, height, width)or(batch size, num_vector_embeds - 1, num_latent_pixels)if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_statesinput. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.