Diffusers documentation
OvisImageTransformer2DModel
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.35.1).
OvisImageTransformer2DModel
The model can be loaded with the following code snippet.
from diffusers import OvisImageTransformer2DModel
transformer = OvisImageTransformer2DModel.from_pretrained("AIDC-AI/Ovis-Image-7B", subfolder="transformer", torch_dtype=torch.bfloat16)OvisImageTransformer2DModel
class diffusers.OvisImageTransformer2DModel
< source >( patch_size: int = 1 in_channels: int = 64 out_channels: typing.Optional[int] = 64 num_layers: int = 6 num_single_layers: int = 27 attention_head_dim: int = 128 num_attention_heads: int = 24 joint_attention_dim: int = 2048 axes_dims_rope: typing.Tuple[int, int, int] = (16, 56, 56) )
Parameters
- patch_size (
int, defaults to1) — Patch size to turn the input data into small patches. - in_channels (
int, defaults to64) — The number of channels in the input. - out_channels (
int, optional, defaults toNone) — The number of channels in the output. If not specified, it defaults toin_channels. - num_layers (
int, defaults to6) — The number of layers of dual stream DiT blocks to use. - num_single_layers (
int, defaults to27) — The number of layers of single stream DiT blocks to use. - attention_head_dim (
int, defaults to128) — The number of dimensions to use for each attention head. - num_attention_heads (
int, defaults to24) — The number of attention heads to use. - joint_attention_dim (
int, defaults to2048) — The number of dimensions to use for the joint attention (embedding/channel dimension ofencoder_hidden_states). - axes_dims_rope (
Tuple[int], defaults to(16, 56, 56)) — The dimensions to use for the rotary positional embeddings.
The Transformer model introduced in Ovis-Image.
Reference: https://github.com/AIDC-AI/Ovis-Image
forward
< source >( hidden_states: Tensor encoder_hidden_states: Tensor = None timestep: LongTensor = None img_ids: Tensor = None txt_ids: Tensor = None return_dict: bool = True )
Parameters
- hidden_states (
torch.Tensorof shape(batch_size, image_sequence_length, in_channels)) — Inputhidden_states. - encoder_hidden_states (
torch.Tensorof shape(batch_size, text_sequence_length, joint_attention_dim)) — Conditional embeddings (embeddings computed from the input conditions such as prompts) to use. - timestep (
torch.LongTensor) — Used to indicate denoising step. - img_ids — (
torch.Tensor): The position ids for image tokens. - txt_ids (
torch.Tensor) — The position ids for text tokens. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple.
The OvisImageTransformer2DModel forward method.