Daankular commited on
Commit
6b03536
·
1 Parent(s): bc24ce2

Patch PSHuman unet: _load_state_dict_into_model moved to model_loading_utils in diffusers>=0.28

Browse files
patches/pshuman/mvdiffusion/models_unclip/unet_mv2d_condition.py ADDED
@@ -0,0 +1,1727 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 The HuggingFace Team. All rights reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ from dataclasses import dataclass
15
+ from typing import Any, Dict, List, Optional, Tuple, Union
16
+ import os
17
+
18
+ import torch
19
+ import torch.nn as nn
20
+ import torch.utils.checkpoint
21
+
22
+ from diffusers.configuration_utils import ConfigMixin, register_to_config
23
+ from diffusers.loaders import UNet2DConditionLoadersMixin
24
+ from diffusers.utils import BaseOutput, logging
25
+ from diffusers.models.activations import get_activation
26
+ from diffusers.models.attention_processor import AttentionProcessor, AttnProcessor
27
+ from diffusers.models.embeddings import (
28
+ GaussianFourierProjection,
29
+ ImageHintTimeEmbedding,
30
+ ImageProjection,
31
+ ImageTimeEmbedding,
32
+ TextImageProjection,
33
+ TextImageTimeEmbedding,
34
+ TextTimeEmbedding,
35
+ TimestepEmbedding,
36
+ Timesteps,
37
+ )
38
+ from diffusers.models.modeling_utils import ModelMixin, load_state_dict
39
+ try:
40
+ from diffusers.models.model_loading_utils import _load_state_dict_into_model
41
+ except ImportError:
42
+ from diffusers.models.modeling_utils import _load_state_dict_into_model
43
+ from diffusers.models.unets.unet_2d_blocks import (
44
+ CrossAttnDownBlock2D,
45
+ CrossAttnUpBlock2D,
46
+ DownBlock2D,
47
+ UNetMidBlock2DCrossAttn,
48
+ UNetMidBlock2DSimpleCrossAttn,
49
+ UpBlock2D,
50
+ )
51
+ from diffusers.utils import (
52
+ CONFIG_NAME,
53
+ FLAX_WEIGHTS_NAME,
54
+ SAFETENSORS_WEIGHTS_NAME,
55
+ WEIGHTS_NAME,
56
+ _add_variant,
57
+ _get_model_file,
58
+ deprecate,
59
+ is_torch_version,
60
+ logging,
61
+ )
62
+ from diffusers.utils.import_utils import is_accelerate_available
63
+ from diffusers.utils.hub_utils import HF_HUB_OFFLINE
64
+ from huggingface_hub.constants import HUGGINGFACE_HUB_CACHE
65
+ DIFFUSERS_CACHE = HUGGINGFACE_HUB_CACHE
66
+
67
+ from diffusers import __version__
68
+ from .unet_mv2d_blocks import (
69
+ CrossAttnDownBlockMV2D,
70
+ CrossAttnUpBlockMV2D,
71
+ UNetMidBlockMV2DCrossAttn,
72
+ get_down_block,
73
+ get_up_block,
74
+ )
75
+ from einops import rearrange, repeat
76
+
77
+ from diffusers import __version__
78
+ from mvdiffusion.models_unclip.unet_mv2d_blocks import (
79
+ CrossAttnDownBlockMV2D,
80
+ CrossAttnUpBlockMV2D,
81
+ UNetMidBlockMV2DCrossAttn,
82
+ get_down_block,
83
+ get_up_block,
84
+ )
85
+
86
+
87
+ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
88
+
89
+
90
+ @dataclass
91
+ class UNetMV2DConditionOutput(BaseOutput):
92
+ """
93
+ The output of [`UNet2DConditionModel`].
94
+
95
+ Args:
96
+ sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
97
+ The hidden states output conditioned on `encoder_hidden_states` input. Output of last layer of model.
98
+ """
99
+
100
+ sample: torch.FloatTensor = None
101
+
102
+
103
+ class ResidualBlock(nn.Module):
104
+ def __init__(self, dim):
105
+ super(ResidualBlock, self).__init__()
106
+ self.linear1 = nn.Linear(dim, dim)
107
+ self.activation = nn.SiLU()
108
+ self.linear2 = nn.Linear(dim, dim)
109
+
110
+ def forward(self, x):
111
+ identity = x
112
+ out = self.linear1(x)
113
+ out = self.activation(out)
114
+ out = self.linear2(out)
115
+ out += identity
116
+ out = self.activation(out)
117
+ return out
118
+
119
+ class ResidualLiner(nn.Module):
120
+ def __init__(self, in_features, out_features, dim, act=None, num_block=1):
121
+ super(ResidualLiner, self).__init__()
122
+ self.linear_in = nn.Sequential(nn.Linear(in_features, dim), nn.SiLU())
123
+
124
+ blocks = nn.ModuleList()
125
+ for _ in range(num_block):
126
+ blocks.append(ResidualBlock(dim))
127
+ self.blocks = blocks
128
+
129
+ self.linear_out = nn.Linear(dim, out_features)
130
+ self.act = act
131
+
132
+ def forward(self, x):
133
+ out = self.linear_in(x)
134
+ for block in self.blocks:
135
+ out = block(out)
136
+ out = self.linear_out(out)
137
+ if self.act is not None:
138
+ out = self.act(out)
139
+ return out
140
+
141
+ class BasicConvBlock(nn.Module):
142
+ def __init__(self, in_channels, out_channels, stride=1):
143
+ super(BasicConvBlock, self).__init__()
144
+ self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
145
+ self.norm1 = nn.GroupNorm(num_groups=8, num_channels=in_channels, affine=True)
146
+ self.act = nn.SiLU()
147
+ self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
148
+ self.norm2 = nn.GroupNorm(num_groups=8, num_channels=in_channels, affine=True)
149
+ self.downsample = nn.Sequential()
150
+ if stride != 1 or in_channels != out_channels:
151
+ self.downsample = nn.Sequential(
152
+ nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
153
+ nn.GroupNorm(num_groups=8, num_channels=in_channels, affine=True)
154
+ )
155
+
156
+ def forward(self, x):
157
+ identity = x
158
+ out = self.conv1(x)
159
+ out = self.norm1(out)
160
+ out = self.act(out)
161
+ out = self.conv2(out)
162
+ out = self.norm2(out)
163
+ out += self.downsample(identity)
164
+ out = self.act(out)
165
+ return out
166
+
167
+ class UNetMV2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
168
+ r"""
169
+ A conditional 2D UNet model that takes a noisy sample, conditional state, and a timestep and returns a sample
170
+ shaped output.
171
+
172
+ This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
173
+ for all models (such as downloading or saving).
174
+
175
+ Parameters:
176
+ sample_size (`int` or `Tuple[int, int]`, *optional*, defaults to `None`):
177
+ Height and width of input/output sample.
178
+ in_channels (`int`, *optional*, defaults to 4): Number of channels in the input sample.
179
+ out_channels (`int`, *optional*, defaults to 4): Number of channels in the output.
180
+ center_input_sample (`bool`, *optional*, defaults to `False`): Whether to center the input sample.
181
+ flip_sin_to_cos (`bool`, *optional*, defaults to `False`):
182
+ Whether to flip the sin to cos in the time embedding.
183
+ freq_shift (`int`, *optional*, defaults to 0): The frequency shift to apply to the time embedding.
184
+ down_block_types (`Tuple[str]`, *optional*, defaults to `("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D")`):
185
+ The tuple of downsample blocks to use.
186
+ mid_block_type (`str`, *optional*, defaults to `"UNetMidBlock2DCrossAttn"`):
187
+ Block type for middle of UNet, it can be either `UNetMidBlock2DCrossAttn` or
188
+ `UNetMidBlock2DSimpleCrossAttn`. If `None`, the mid block layer is skipped.
189
+ up_block_types (`Tuple[str]`, *optional*, defaults to `("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D")`):
190
+ The tuple of upsample blocks to use.
191
+ only_cross_attention(`bool` or `Tuple[bool]`, *optional*, default to `False`):
192
+ Whether to include self-attention in the basic transformer blocks, see
193
+ [`~models.attention.BasicTransformerBlock`].
194
+ block_out_channels (`Tuple[int]`, *optional*, defaults to `(320, 640, 1280, 1280)`):
195
+ The tuple of output channels for each block.
196
+ layers_per_block (`int`, *optional*, defaults to 2): The number of layers per block.
197
+ downsample_padding (`int`, *optional*, defaults to 1): The padding to use for the downsampling convolution.
198
+ mid_block_scale_factor (`float`, *optional*, defaults to 1.0): The scale factor to use for the mid block.
199
+ act_fn (`str`, *optional*, defaults to `"silu"`): The activation function to use.
200
+ norm_num_groups (`int`, *optional*, defaults to 32): The number of groups to use for the normalization.
201
+ If `None`, normalization and activation layers is skipped in post-processing.
202
+ norm_eps (`float`, *optional*, defaults to 1e-5): The epsilon to use for the normalization.
203
+ cross_attention_dim (`int` or `Tuple[int]`, *optional*, defaults to 1280):
204
+ The dimension of the cross attention features.
205
+ transformer_layers_per_block (`int` or `Tuple[int]`, *optional*, defaults to 1):
206
+ The number of transformer blocks of type [`~models.attention.BasicTransformerBlock`]. Only relevant for
207
+ [`~models.unet_2d_blocks.CrossAttnDownBlock2D`], [`~models.unet_2d_blocks.CrossAttnUpBlock2D`],
208
+ [`~models.unet_2d_blocks.UNetMidBlock2DCrossAttn`].
209
+ encoder_hid_dim (`int`, *optional*, defaults to None):
210
+ If `encoder_hid_dim_type` is defined, `encoder_hidden_states` will be projected from `encoder_hid_dim`
211
+ dimension to `cross_attention_dim`.
212
+ encoder_hid_dim_type (`str`, *optional*, defaults to `None`):
213
+ If given, the `encoder_hidden_states` and potentially other embeddings are down-projected to text
214
+ embeddings of dimension `cross_attention` according to `encoder_hid_dim_type`.
215
+ attention_head_dim (`int`, *optional*, defaults to 8): The dimension of the attention heads.
216
+ num_attention_heads (`int`, *optional*):
217
+ The number of attention heads. If not defined, defaults to `attention_head_dim`
218
+ resnet_time_scale_shift (`str`, *optional*, defaults to `"default"`): Time scale shift config
219
+ for ResNet blocks (see [`~models.resnet.ResnetBlock2D`]). Choose from `default` or `scale_shift`.
220
+ class_embed_type (`str`, *optional*, defaults to `None`):
221
+ The type of class embedding to use which is ultimately summed with the time embeddings. Choose from `None`,
222
+ `"timestep"`, `"identity"`, `"projection"`, or `"simple_projection"`.
223
+ addition_embed_type (`str`, *optional*, defaults to `None`):
224
+ Configures an optional embedding which will be summed with the time embeddings. Choose from `None` or
225
+ "text". "text" will use the `TextTimeEmbedding` layer.
226
+ addition_time_embed_dim: (`int`, *optional*, defaults to `None`):
227
+ Dimension for the timestep embeddings.
228
+ num_class_embeds (`int`, *optional*, defaults to `None`):
229
+ Input dimension of the learnable embedding matrix to be projected to `time_embed_dim`, when performing
230
+ class conditioning with `class_embed_type` equal to `None`.
231
+ time_embedding_type (`str`, *optional*, defaults to `positional`):
232
+ The type of position embedding to use for timesteps. Choose from `positional` or `fourier`.
233
+ time_embedding_dim (`int`, *optional*, defaults to `None`):
234
+ An optional override for the dimension of the projected time embedding.
235
+ time_embedding_act_fn (`str`, *optional*, defaults to `None`):
236
+ Optional activation function to use only once on the time embeddings before they are passed to the rest of
237
+ the UNet. Choose from `silu`, `mish`, `gelu`, and `swish`.
238
+ timestep_post_act (`str`, *optional*, defaults to `None`):
239
+ The second activation function to use in timestep embedding. Choose from `silu`, `mish` and `gelu`.
240
+ time_cond_proj_dim (`int`, *optional*, defaults to `None`):
241
+ The dimension of `cond_proj` layer in the timestep embedding.
242
+ conv_in_kernel (`int`, *optional*, default to `3`): The kernel size of `conv_in` layer.
243
+ conv_out_kernel (`int`, *optional*, default to `3`): The kernel size of `conv_out` layer.
244
+ projection_class_embeddings_input_dim (`int`, *optional*): The dimension of the `class_labels` input when
245
+ `class_embed_type="projection"`. Required when `class_embed_type="projection"`.
246
+ class_embeddings_concat (`bool`, *optional*, defaults to `False`): Whether to concatenate the time
247
+ embeddings with the class embeddings.
248
+ mid_block_only_cross_attention (`bool`, *optional*, defaults to `None`):
249
+ Whether to use cross attention with the mid block when using the `UNetMidBlock2DSimpleCrossAttn`. If
250
+ `only_cross_attention` is given as a single boolean and `mid_block_only_cross_attention` is `None`, the
251
+ `only_cross_attention` value is used as the value for `mid_block_only_cross_attention`. Default to `False`
252
+ otherwise.
253
+ """
254
+
255
+ _supports_gradient_checkpointing = True
256
+
257
+ @register_to_config
258
+ def __init__(
259
+ self,
260
+ sample_size: Optional[int] = None,
261
+ in_channels: int = 4,
262
+ out_channels: int = 4,
263
+ center_input_sample: bool = False,
264
+ flip_sin_to_cos: bool = True,
265
+ freq_shift: int = 0,
266
+ down_block_types: Tuple[str] = (
267
+ "CrossAttnDownBlockMV2D",
268
+ "CrossAttnDownBlockMV2D",
269
+ "CrossAttnDownBlockMV2D",
270
+ "DownBlock2D",
271
+ ),
272
+ mid_block_type: Optional[str] = "UNetMidBlockMV2DCrossAttn",
273
+ up_block_types: Tuple[str] = ("UpBlock2D", "CrossAttnUpBlockMV2D", "CrossAttnUpBlockMV2D", "CrossAttnUpBlockMV2D"),
274
+ only_cross_attention: Union[bool, Tuple[bool]] = False,
275
+ block_out_channels: Tuple[int] = (320, 640, 1280, 1280),
276
+ layers_per_block: Union[int, Tuple[int]] = 2,
277
+ downsample_padding: int = 1,
278
+ mid_block_scale_factor: float = 1,
279
+ act_fn: str = "silu",
280
+ norm_num_groups: Optional[int] = 32,
281
+ norm_eps: float = 1e-5,
282
+ cross_attention_dim: Union[int, Tuple[int]] = 1280,
283
+ transformer_layers_per_block: Union[int, Tuple[int]] = 1,
284
+ encoder_hid_dim: Optional[int] = None,
285
+ encoder_hid_dim_type: Optional[str] = None,
286
+ attention_head_dim: Union[int, Tuple[int]] = 8,
287
+ num_attention_heads: Optional[Union[int, Tuple[int]]] = None,
288
+ dual_cross_attention: bool = False,
289
+ use_linear_projection: bool = False,
290
+ class_embed_type: Optional[str] = None,
291
+ addition_embed_type: Optional[str] = None,
292
+ addition_time_embed_dim: Optional[int] = None,
293
+ num_class_embeds: Optional[int] = None,
294
+ upcast_attention: bool = False,
295
+ resnet_time_scale_shift: str = "default",
296
+ resnet_skip_time_act: bool = False,
297
+ resnet_out_scale_factor: int = 1.0,
298
+ time_embedding_type: str = "positional",
299
+ time_embedding_dim: Optional[int] = None,
300
+ time_embedding_act_fn: Optional[str] = None,
301
+ timestep_post_act: Optional[str] = None,
302
+ time_cond_proj_dim: Optional[int] = None,
303
+ conv_in_kernel: int = 3,
304
+ conv_out_kernel: int = 3,
305
+ projection_class_embeddings_input_dim: Optional[int] = None,
306
+ projection_camera_embeddings_input_dim: Optional[int] = None,
307
+ class_embeddings_concat: bool = False,
308
+ mid_block_only_cross_attention: Optional[bool] = None,
309
+ cross_attention_norm: Optional[str] = None,
310
+ addition_embed_type_num_heads=64,
311
+ num_views: int = 1,
312
+ cd_attention_last: bool = False,
313
+ cd_attention_mid: bool = False,
314
+ multiview_attention: bool = True,
315
+ sparse_mv_attention: bool = False,
316
+ selfattn_block: str = "custom",
317
+ mvcd_attention: bool = False,
318
+ regress_elevation: bool = False,
319
+ regress_focal_length: bool = False,
320
+ num_regress_blocks: int = 4,
321
+ use_dino: bool = False,
322
+ addition_downsample: bool = False,
323
+ addition_channels: Optional[Tuple[int]] = (1280, 1280, 1280),
324
+ ):
325
+ super().__init__()
326
+
327
+ self.sample_size = sample_size
328
+ self.num_views = num_views
329
+ self.mvcd_attention = mvcd_attention
330
+ if num_attention_heads is not None:
331
+ raise ValueError(
332
+ "At the moment it is not possible to define the number of attention heads via `num_attention_heads` because of a naming issue as described in https://github.com/huggingface/diffusers/issues/2011#issuecomment-1547958131. Passing `num_attention_heads` will only be supported in diffusers v0.19."
333
+ )
334
+
335
+ # If `num_attention_heads` is not defined (which is the case for most models)
336
+ # it will default to `attention_head_dim`. This looks weird upon first reading it and it is.
337
+ # The reason for this behavior is to correct for incorrectly named variables that were introduced
338
+ # when this library was created. The incorrect naming was only discovered much later in https://github.com/huggingface/diffusers/issues/2011#issuecomment-1547958131
339
+ # Changing `attention_head_dim` to `num_attention_heads` for 40,000+ configurations is too backwards breaking
340
+ # which is why we correct for the naming here.
341
+ num_attention_heads = num_attention_heads or attention_head_dim
342
+
343
+ # Check inputs
344
+ if len(down_block_types) != len(up_block_types):
345
+ raise ValueError(
346
+ f"Must provide the same number of `down_block_types` as `up_block_types`. `down_block_types`: {down_block_types}. `up_block_types`: {up_block_types}."
347
+ )
348
+
349
+ if len(block_out_channels) != len(down_block_types):
350
+ raise ValueError(
351
+ f"Must provide the same number of `block_out_channels` as `down_block_types`. `block_out_channels`: {block_out_channels}. `down_block_types`: {down_block_types}."
352
+ )
353
+
354
+ if not isinstance(only_cross_attention, bool) and len(only_cross_attention) != len(down_block_types):
355
+ raise ValueError(
356
+ f"Must provide the same number of `only_cross_attention` as `down_block_types`. `only_cross_attention`: {only_cross_attention}. `down_block_types`: {down_block_types}."
357
+ )
358
+
359
+ if not isinstance(num_attention_heads, int) and len(num_attention_heads) != len(down_block_types):
360
+ raise ValueError(
361
+ f"Must provide the same number of `num_attention_heads` as `down_block_types`. `num_attention_heads`: {num_attention_heads}. `down_block_types`: {down_block_types}."
362
+ )
363
+
364
+ if not isinstance(attention_head_dim, int) and len(attention_head_dim) != len(down_block_types):
365
+ raise ValueError(
366
+ f"Must provide the same number of `attention_head_dim` as `down_block_types`. `attention_head_dim`: {attention_head_dim}. `down_block_types`: {down_block_types}."
367
+ )
368
+
369
+ if isinstance(cross_attention_dim, list) and len(cross_attention_dim) != len(down_block_types):
370
+ raise ValueError(
371
+ f"Must provide the same number of `cross_attention_dim` as `down_block_types`. `cross_attention_dim`: {cross_attention_dim}. `down_block_types`: {down_block_types}."
372
+ )
373
+
374
+ if not isinstance(layers_per_block, int) and len(layers_per_block) != len(down_block_types):
375
+ raise ValueError(
376
+ f"Must provide the same number of `layers_per_block` as `down_block_types`. `layers_per_block`: {layers_per_block}. `down_block_types`: {down_block_types}."
377
+ )
378
+
379
+ # input
380
+ conv_in_padding = (conv_in_kernel - 1) // 2
381
+ self.conv_in = nn.Conv2d(
382
+ in_channels, block_out_channels[0], kernel_size=conv_in_kernel, padding=conv_in_padding
383
+ )
384
+
385
+ # time
386
+ if time_embedding_type == "fourier":
387
+ time_embed_dim = time_embedding_dim or block_out_channels[0] * 2
388
+ if time_embed_dim % 2 != 0:
389
+ raise ValueError(f"`time_embed_dim` should be divisible by 2, but is {time_embed_dim}.")
390
+ self.time_proj = GaussianFourierProjection(
391
+ time_embed_dim // 2, set_W_to_weight=False, log=False, flip_sin_to_cos=flip_sin_to_cos
392
+ )
393
+ timestep_input_dim = time_embed_dim
394
+ elif time_embedding_type == "positional":
395
+ time_embed_dim = time_embedding_dim or block_out_channels[0] * 4
396
+
397
+ self.time_proj = Timesteps(block_out_channels[0], flip_sin_to_cos, freq_shift)
398
+ timestep_input_dim = block_out_channels[0]
399
+ else:
400
+ raise ValueError(
401
+ f"{time_embedding_type} does not exist. Please make sure to use one of `fourier` or `positional`."
402
+ )
403
+
404
+ self.time_embedding = TimestepEmbedding(
405
+ timestep_input_dim,
406
+ time_embed_dim,
407
+ act_fn=act_fn,
408
+ post_act_fn=timestep_post_act,
409
+ cond_proj_dim=time_cond_proj_dim,
410
+ )
411
+
412
+ if encoder_hid_dim_type is None and encoder_hid_dim is not None:
413
+ encoder_hid_dim_type = "text_proj"
414
+ self.register_to_config(encoder_hid_dim_type=encoder_hid_dim_type)
415
+ logger.info("encoder_hid_dim_type defaults to 'text_proj' as `encoder_hid_dim` is defined.")
416
+
417
+ if encoder_hid_dim is None and encoder_hid_dim_type is not None:
418
+ raise ValueError(
419
+ f"`encoder_hid_dim` has to be defined when `encoder_hid_dim_type` is set to {encoder_hid_dim_type}."
420
+ )
421
+
422
+ if encoder_hid_dim_type == "text_proj":
423
+ self.encoder_hid_proj = nn.Linear(encoder_hid_dim, cross_attention_dim)
424
+ elif encoder_hid_dim_type == "text_image_proj":
425
+ # image_embed_dim DOESN'T have to be `cross_attention_dim`. To not clutter the __init__ too much
426
+ # they are set to `cross_attention_dim` here as this is exactly the required dimension for the currently only use
427
+ # case when `addition_embed_type == "text_image_proj"` (Kadinsky 2.1)`
428
+ self.encoder_hid_proj = TextImageProjection(
429
+ text_embed_dim=encoder_hid_dim,
430
+ image_embed_dim=cross_attention_dim,
431
+ cross_attention_dim=cross_attention_dim,
432
+ )
433
+ elif encoder_hid_dim_type == "image_proj":
434
+ # Kandinsky 2.2
435
+ self.encoder_hid_proj = ImageProjection(
436
+ image_embed_dim=encoder_hid_dim,
437
+ cross_attention_dim=cross_attention_dim,
438
+ )
439
+ elif encoder_hid_dim_type is not None:
440
+ raise ValueError(
441
+ f"encoder_hid_dim_type: {encoder_hid_dim_type} must be None, 'text_proj' or 'text_image_proj'."
442
+ )
443
+ else:
444
+ self.encoder_hid_proj = None
445
+
446
+ # class embedding
447
+ if class_embed_type is None and num_class_embeds is not None:
448
+ self.class_embedding = nn.Embedding(num_class_embeds, time_embed_dim)
449
+ elif class_embed_type == "timestep":
450
+ self.class_embedding = TimestepEmbedding(timestep_input_dim, time_embed_dim, act_fn=act_fn)
451
+ elif class_embed_type == "identity":
452
+ self.class_embedding = nn.Identity(time_embed_dim, time_embed_dim)
453
+ elif class_embed_type == "projection":
454
+ if projection_class_embeddings_input_dim is None:
455
+ raise ValueError(
456
+ "`class_embed_type`: 'projection' requires `projection_class_embeddings_input_dim` be set"
457
+ )
458
+ # The projection `class_embed_type` is the same as the timestep `class_embed_type` except
459
+ # 1. the `class_labels` inputs are not first converted to sinusoidal embeddings
460
+ # 2. it projects from an arbitrary input dimension.
461
+ #
462
+ # Note that `TimestepEmbedding` is quite general, being mainly linear layers and activations.
463
+ # When used for embedding actual timesteps, the timesteps are first converted to sinusoidal embeddings.
464
+ # As a result, `TimestepEmbedding` can be passed arbitrary vectors.
465
+ self.class_embedding = TimestepEmbedding(projection_class_embeddings_input_dim, time_embed_dim)
466
+ elif class_embed_type == "simple_projection":
467
+ if projection_class_embeddings_input_dim is None:
468
+ raise ValueError(
469
+ "`class_embed_type`: 'simple_projection' requires `projection_class_embeddings_input_dim` be set"
470
+ )
471
+ self.class_embedding = nn.Linear(projection_class_embeddings_input_dim, time_embed_dim)
472
+ else:
473
+ self.class_embedding = None
474
+
475
+ if addition_embed_type == "text":
476
+ if encoder_hid_dim is not None:
477
+ text_time_embedding_from_dim = encoder_hid_dim
478
+ else:
479
+ text_time_embedding_from_dim = cross_attention_dim
480
+
481
+ self.add_embedding = TextTimeEmbedding(
482
+ text_time_embedding_from_dim, time_embed_dim, num_heads=addition_embed_type_num_heads
483
+ )
484
+ elif addition_embed_type == "text_image":
485
+ # text_embed_dim and image_embed_dim DON'T have to be `cross_attention_dim`. To not clutter the __init__ too much
486
+ # they are set to `cross_attention_dim` here as this is exactly the required dimension for the currently only use
487
+ # case when `addition_embed_type == "text_image"` (Kadinsky 2.1)`
488
+ self.add_embedding = TextImageTimeEmbedding(
489
+ text_embed_dim=cross_attention_dim, image_embed_dim=cross_attention_dim, time_embed_dim=time_embed_dim
490
+ )
491
+ elif addition_embed_type == "text_time":
492
+ self.add_time_proj = Timesteps(addition_time_embed_dim, flip_sin_to_cos, freq_shift)
493
+ self.add_embedding = TimestepEmbedding(projection_class_embeddings_input_dim, time_embed_dim)
494
+ elif addition_embed_type == "image":
495
+ # Kandinsky 2.2
496
+ self.add_embedding = ImageTimeEmbedding(image_embed_dim=encoder_hid_dim, time_embed_dim=time_embed_dim)
497
+ elif addition_embed_type == "image_hint":
498
+ # Kandinsky 2.2 ControlNet
499
+ self.add_embedding = ImageHintTimeEmbedding(image_embed_dim=encoder_hid_dim, time_embed_dim=time_embed_dim)
500
+ elif addition_embed_type is not None:
501
+ raise ValueError(f"addition_embed_type: {addition_embed_type} must be None, 'text' or 'text_image'.")
502
+
503
+ if time_embedding_act_fn is None:
504
+ self.time_embed_act = None
505
+ else:
506
+ self.time_embed_act = get_activation(time_embedding_act_fn)
507
+
508
+ self.down_blocks = nn.ModuleList([])
509
+ self.up_blocks = nn.ModuleList([])
510
+
511
+ if isinstance(only_cross_attention, bool):
512
+ if mid_block_only_cross_attention is None:
513
+ mid_block_only_cross_attention = only_cross_attention
514
+
515
+ only_cross_attention = [only_cross_attention] * len(down_block_types)
516
+
517
+ if mid_block_only_cross_attention is None:
518
+ mid_block_only_cross_attention = False
519
+
520
+ if isinstance(num_attention_heads, int):
521
+ num_attention_heads = (num_attention_heads,) * len(down_block_types)
522
+
523
+ if isinstance(attention_head_dim, int):
524
+ attention_head_dim = (attention_head_dim,) * len(down_block_types)
525
+
526
+ if isinstance(cross_attention_dim, int):
527
+ cross_attention_dim = (cross_attention_dim,) * len(down_block_types)
528
+
529
+ if isinstance(layers_per_block, int):
530
+ layers_per_block = [layers_per_block] * len(down_block_types)
531
+
532
+ if isinstance(transformer_layers_per_block, int):
533
+ transformer_layers_per_block = [transformer_layers_per_block] * len(down_block_types)
534
+
535
+ if class_embeddings_concat:
536
+ # The time embeddings are concatenated with the class embeddings. The dimension of the
537
+ # time embeddings passed to the down, middle, and up blocks is twice the dimension of the
538
+ # regular time embeddings
539
+ blocks_time_embed_dim = time_embed_dim * 2
540
+ else:
541
+ blocks_time_embed_dim = time_embed_dim
542
+
543
+ # down
544
+ output_channel = block_out_channels[0]
545
+ for i, down_block_type in enumerate(down_block_types):
546
+ input_channel = output_channel
547
+ output_channel = block_out_channels[i]
548
+ is_final_block = i == len(block_out_channels) - 1
549
+
550
+ down_block = get_down_block(
551
+ down_block_type,
552
+ num_layers=layers_per_block[i],
553
+ transformer_layers_per_block=transformer_layers_per_block[i],
554
+ in_channels=input_channel,
555
+ out_channels=output_channel,
556
+ temb_channels=blocks_time_embed_dim,
557
+ add_downsample=not is_final_block,
558
+ resnet_eps=norm_eps,
559
+ resnet_act_fn=act_fn,
560
+ resnet_groups=norm_num_groups,
561
+ cross_attention_dim=cross_attention_dim[i],
562
+ num_attention_heads=num_attention_heads[i],
563
+ downsample_padding=downsample_padding,
564
+ dual_cross_attention=dual_cross_attention,
565
+ use_linear_projection=use_linear_projection,
566
+ only_cross_attention=only_cross_attention[i],
567
+ upcast_attention=upcast_attention,
568
+ resnet_time_scale_shift=resnet_time_scale_shift,
569
+ resnet_skip_time_act=resnet_skip_time_act,
570
+ resnet_out_scale_factor=resnet_out_scale_factor,
571
+ cross_attention_norm=cross_attention_norm,
572
+ attention_head_dim=attention_head_dim[i] if attention_head_dim[i] is not None else output_channel,
573
+ num_views=num_views,
574
+ cd_attention_last=cd_attention_last,
575
+ cd_attention_mid=cd_attention_mid,
576
+ multiview_attention=multiview_attention,
577
+ sparse_mv_attention=sparse_mv_attention,
578
+ selfattn_block=selfattn_block,
579
+ mvcd_attention=mvcd_attention,
580
+ use_dino=use_dino
581
+ )
582
+ self.down_blocks.append(down_block)
583
+
584
+ # mid
585
+ if mid_block_type == "UNetMidBlock2DCrossAttn":
586
+ self.mid_block = UNetMidBlock2DCrossAttn(
587
+ transformer_layers_per_block=transformer_layers_per_block[-1],
588
+ in_channels=block_out_channels[-1],
589
+ temb_channels=blocks_time_embed_dim,
590
+ resnet_eps=norm_eps,
591
+ resnet_act_fn=act_fn,
592
+ output_scale_factor=mid_block_scale_factor,
593
+ resnet_time_scale_shift=resnet_time_scale_shift,
594
+ cross_attention_dim=cross_attention_dim[-1],
595
+ num_attention_heads=num_attention_heads[-1],
596
+ resnet_groups=norm_num_groups,
597
+ dual_cross_attention=dual_cross_attention,
598
+ use_linear_projection=use_linear_projection,
599
+ upcast_attention=upcast_attention,
600
+ )
601
+ # custom MV2D attention block
602
+ elif mid_block_type == "UNetMidBlockMV2DCrossAttn":
603
+ self.mid_block = UNetMidBlockMV2DCrossAttn(
604
+ transformer_layers_per_block=transformer_layers_per_block[-1],
605
+ in_channels=block_out_channels[-1],
606
+ temb_channels=blocks_time_embed_dim,
607
+ resnet_eps=norm_eps,
608
+ resnet_act_fn=act_fn,
609
+ output_scale_factor=mid_block_scale_factor,
610
+ resnet_time_scale_shift=resnet_time_scale_shift,
611
+ cross_attention_dim=cross_attention_dim[-1],
612
+ num_attention_heads=num_attention_heads[-1],
613
+ resnet_groups=norm_num_groups,
614
+ dual_cross_attention=dual_cross_attention,
615
+ use_linear_projection=use_linear_projection,
616
+ upcast_attention=upcast_attention,
617
+ num_views=num_views,
618
+ cd_attention_last=cd_attention_last,
619
+ cd_attention_mid=cd_attention_mid,
620
+ multiview_attention=multiview_attention,
621
+ sparse_mv_attention=sparse_mv_attention,
622
+ selfattn_block=selfattn_block,
623
+ mvcd_attention=mvcd_attention,
624
+ use_dino=use_dino
625
+ )
626
+ elif mid_block_type == "UNetMidBlock2DSimpleCrossAttn":
627
+ self.mid_block = UNetMidBlock2DSimpleCrossAttn(
628
+ in_channels=block_out_channels[-1],
629
+ temb_channels=blocks_time_embed_dim,
630
+ resnet_eps=norm_eps,
631
+ resnet_act_fn=act_fn,
632
+ output_scale_factor=mid_block_scale_factor,
633
+ cross_attention_dim=cross_attention_dim[-1],
634
+ attention_head_dim=attention_head_dim[-1],
635
+ resnet_groups=norm_num_groups,
636
+ resnet_time_scale_shift=resnet_time_scale_shift,
637
+ skip_time_act=resnet_skip_time_act,
638
+ only_cross_attention=mid_block_only_cross_attention,
639
+ cross_attention_norm=cross_attention_norm,
640
+ )
641
+ elif mid_block_type is None:
642
+ self.mid_block = None
643
+ else:
644
+ raise ValueError(f"unknown mid_block_type : {mid_block_type}")
645
+
646
+ self.addition_downsample = addition_downsample
647
+ if self.addition_downsample:
648
+ inc = block_out_channels[-1]
649
+ self.downsample = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
650
+ self.conv_block = nn.ModuleList()
651
+ self.conv_block.append(BasicConvBlock(inc, addition_channels[0], stride=1))
652
+ for dim_ in addition_channels[1:-1]:
653
+ self.conv_block.append(BasicConvBlock(dim_, dim_, stride=1))
654
+ self.conv_block.append(BasicConvBlock(dim_, inc))
655
+ self.addition_conv_out = nn.Conv2d(inc, inc, kernel_size=1, bias=False)
656
+ nn.init.zeros_(self.addition_conv_out.weight.data)
657
+ self.addition_act_out = nn.SiLU()
658
+ self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
659
+
660
+ self.regress_elevation = regress_elevation
661
+ self.regress_focal_length = regress_focal_length
662
+ if regress_elevation or regress_focal_length:
663
+ self.pool = nn.AdaptiveAvgPool2d((1, 1))
664
+ self.camera_embedding = TimestepEmbedding(projection_camera_embeddings_input_dim, time_embed_dim=time_embed_dim)
665
+
666
+ regress_in_dim = block_out_channels[-1]*2 if mvcd_attention else block_out_channels
667
+
668
+ if regress_elevation:
669
+ self.elevation_regressor = ResidualLiner(regress_in_dim, 1, 1280, act=None, num_block=num_regress_blocks)
670
+ if regress_focal_length:
671
+ self.focal_regressor = ResidualLiner(regress_in_dim, 1, 1280, act=None, num_block=num_regress_blocks)
672
+ '''
673
+ self.regress_elevation = regress_elevation
674
+ self.regress_focal_length = regress_focal_length
675
+ if regress_elevation and (not regress_focal_length):
676
+ print("Regressing elevation")
677
+ cam_dim = 1
678
+ elif regress_focal_length and (not regress_elevation):
679
+ print("Regressing focal length")
680
+ cam_dim = 6
681
+ elif regress_elevation and regress_focal_length:
682
+ print("Regressing both elevation and focal length")
683
+ cam_dim = 7
684
+ else:
685
+ cam_dim = 0
686
+ assert projection_camera_embeddings_input_dim == 2*cam_dim, "projection_camera_embeddings_input_dim should be 2*cam_dim"
687
+ if regress_elevation or regress_focal_length:
688
+ self.elevation_regressor = nn.ModuleList([
689
+ nn.Linear(block_out_channels[-1], 1280),
690
+ nn.SiLU(),
691
+ nn.Linear(1280, 1280),
692
+ nn.SiLU(),
693
+ nn.Linear(1280, cam_dim)
694
+ ])
695
+ self.pool = nn.AdaptiveAvgPool2d((1, 1))
696
+ self.focal_act = nn.Softmax(dim=-1)
697
+ self.camera_embedding = TimestepEmbedding(projection_camera_embeddings_input_dim, time_embed_dim=time_embed_dim)
698
+ '''
699
+
700
+ # count how many layers upsample the images
701
+ self.num_upsamplers = 0
702
+
703
+ # up
704
+ reversed_block_out_channels = list(reversed(block_out_channels))
705
+ reversed_num_attention_heads = list(reversed(num_attention_heads))
706
+ reversed_layers_per_block = list(reversed(layers_per_block))
707
+ reversed_cross_attention_dim = list(reversed(cross_attention_dim))
708
+ reversed_transformer_layers_per_block = list(reversed(transformer_layers_per_block))
709
+ only_cross_attention = list(reversed(only_cross_attention))
710
+
711
+ output_channel = reversed_block_out_channels[0]
712
+ for i, up_block_type in enumerate(up_block_types):
713
+ is_final_block = i == len(block_out_channels) - 1
714
+
715
+ prev_output_channel = output_channel
716
+ output_channel = reversed_block_out_channels[i]
717
+ input_channel = reversed_block_out_channels[min(i + 1, len(block_out_channels) - 1)]
718
+
719
+ # add upsample block for all BUT final layer
720
+ if not is_final_block:
721
+ add_upsample = True
722
+ self.num_upsamplers += 1
723
+ else:
724
+ add_upsample = False
725
+
726
+ up_block = get_up_block(
727
+ up_block_type,
728
+ num_layers=reversed_layers_per_block[i] + 1,
729
+ transformer_layers_per_block=reversed_transformer_layers_per_block[i],
730
+ in_channels=input_channel,
731
+ out_channels=output_channel,
732
+ prev_output_channel=prev_output_channel,
733
+ temb_channels=blocks_time_embed_dim,
734
+ add_upsample=add_upsample,
735
+ resnet_eps=norm_eps,
736
+ resnet_act_fn=act_fn,
737
+ resnet_groups=norm_num_groups,
738
+ cross_attention_dim=reversed_cross_attention_dim[i],
739
+ num_attention_heads=reversed_num_attention_heads[i],
740
+ dual_cross_attention=dual_cross_attention,
741
+ use_linear_projection=use_linear_projection,
742
+ only_cross_attention=only_cross_attention[i],
743
+ upcast_attention=upcast_attention,
744
+ resnet_time_scale_shift=resnet_time_scale_shift,
745
+ resnet_skip_time_act=resnet_skip_time_act,
746
+ resnet_out_scale_factor=resnet_out_scale_factor,
747
+ cross_attention_norm=cross_attention_norm,
748
+ attention_head_dim=attention_head_dim[i] if attention_head_dim[i] is not None else output_channel,
749
+ num_views=num_views,
750
+ cd_attention_last=cd_attention_last,
751
+ cd_attention_mid=cd_attention_mid,
752
+ multiview_attention=multiview_attention,
753
+ sparse_mv_attention=sparse_mv_attention,
754
+ selfattn_block=selfattn_block,
755
+ mvcd_attention=mvcd_attention,
756
+ use_dino=use_dino
757
+ )
758
+ self.up_blocks.append(up_block)
759
+ prev_output_channel = output_channel
760
+
761
+ # out
762
+ if norm_num_groups is not None:
763
+ self.conv_norm_out = nn.GroupNorm(
764
+ num_channels=block_out_channels[0], num_groups=norm_num_groups, eps=norm_eps
765
+ )
766
+
767
+ self.conv_act = get_activation(act_fn)
768
+
769
+ else:
770
+ self.conv_norm_out = None
771
+ self.conv_act = None
772
+
773
+ conv_out_padding = (conv_out_kernel - 1) // 2
774
+ self.conv_out = nn.Conv2d(
775
+ block_out_channels[0], out_channels, kernel_size=conv_out_kernel, padding=conv_out_padding
776
+ )
777
+
778
+ @property
779
+ def attn_processors(self) -> Dict[str, AttentionProcessor]:
780
+ r"""
781
+ Returns:
782
+ `dict` of attention processors: A dictionary containing all attention processors used in the model with
783
+ indexed by its weight name.
784
+ """
785
+ # set recursively
786
+ processors = {}
787
+
788
+ def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: Dict[str, AttentionProcessor]):
789
+ if hasattr(module, "set_processor"):
790
+ processors[f"{name}.processor"] = module.processor
791
+
792
+ for sub_name, child in module.named_children():
793
+ fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
794
+
795
+ return processors
796
+
797
+ for name, module in self.named_children():
798
+ fn_recursive_add_processors(name, module, processors)
799
+
800
+ return processors
801
+
802
+ def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):
803
+ r"""
804
+ Sets the attention processor to use to compute attention.
805
+
806
+ Parameters:
807
+ processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
808
+ The instantiated processor class or a dictionary of processor classes that will be set as the processor
809
+ for **all** `Attention` layers.
810
+
811
+ If `processor` is a dict, the key needs to define the path to the corresponding cross attention
812
+ processor. This is strongly recommended when setting trainable attention processors.
813
+
814
+ """
815
+ count = len(self.attn_processors.keys())
816
+
817
+ if isinstance(processor, dict) and len(processor) != count:
818
+ raise ValueError(
819
+ f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
820
+ f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
821
+ )
822
+
823
+ def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
824
+ if hasattr(module, "set_processor"):
825
+ if not isinstance(processor, dict):
826
+ module.set_processor(processor)
827
+ else:
828
+ module.set_processor(processor.pop(f"{name}.processor"))
829
+
830
+ for sub_name, child in module.named_children():
831
+ fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
832
+
833
+ for name, module in self.named_children():
834
+ fn_recursive_attn_processor(name, module, processor)
835
+
836
+ def set_default_attn_processor(self):
837
+ """
838
+ Disables custom attention processors and sets the default attention implementation.
839
+ """
840
+ self.set_attn_processor(AttnProcessor())
841
+
842
+ def set_attention_slice(self, slice_size):
843
+ r"""
844
+ Enable sliced attention computation.
845
+
846
+ When this option is enabled, the attention module splits the input tensor in slices to compute attention in
847
+ several steps. This is useful for saving some memory in exchange for a small decrease in speed.
848
+
849
+ Args:
850
+ slice_size (`str` or `int` or `list(int)`, *optional*, defaults to `"auto"`):
851
+ When `"auto"`, input to the attention heads is halved, so attention is computed in two steps. If
852
+ `"max"`, maximum amount of memory is saved by running only one slice at a time. If a number is
853
+ provided, uses as many slices as `attention_head_dim // slice_size`. In this case, `attention_head_dim`
854
+ must be a multiple of `slice_size`.
855
+ """
856
+ sliceable_head_dims = []
857
+
858
+ def fn_recursive_retrieve_sliceable_dims(module: torch.nn.Module):
859
+ if hasattr(module, "set_attention_slice"):
860
+ sliceable_head_dims.append(module.sliceable_head_dim)
861
+
862
+ for child in module.children():
863
+ fn_recursive_retrieve_sliceable_dims(child)
864
+
865
+ # retrieve number of attention layers
866
+ for module in self.children():
867
+ fn_recursive_retrieve_sliceable_dims(module)
868
+
869
+ num_sliceable_layers = len(sliceable_head_dims)
870
+
871
+ if slice_size == "auto":
872
+ # half the attention head size is usually a good trade-off between
873
+ # speed and memory
874
+ slice_size = [dim // 2 for dim in sliceable_head_dims]
875
+ elif slice_size == "max":
876
+ # make smallest slice possible
877
+ slice_size = num_sliceable_layers * [1]
878
+
879
+ slice_size = num_sliceable_layers * [slice_size] if not isinstance(slice_size, list) else slice_size
880
+
881
+ if len(slice_size) != len(sliceable_head_dims):
882
+ raise ValueError(
883
+ f"You have provided {len(slice_size)}, but {self.config} has {len(sliceable_head_dims)} different"
884
+ f" attention layers. Make sure to match `len(slice_size)` to be {len(sliceable_head_dims)}."
885
+ )
886
+
887
+ for i in range(len(slice_size)):
888
+ size = slice_size[i]
889
+ dim = sliceable_head_dims[i]
890
+ if size is not None and size > dim:
891
+ raise ValueError(f"size {size} has to be smaller or equal to {dim}.")
892
+
893
+ # Recursively walk through all the children.
894
+ # Any children which exposes the set_attention_slice method
895
+ # gets the message
896
+ def fn_recursive_set_attention_slice(module: torch.nn.Module, slice_size: List[int]):
897
+ if hasattr(module, "set_attention_slice"):
898
+ module.set_attention_slice(slice_size.pop())
899
+
900
+ for child in module.children():
901
+ fn_recursive_set_attention_slice(child, slice_size)
902
+
903
+ reversed_slice_size = list(reversed(slice_size))
904
+ for module in self.children():
905
+ fn_recursive_set_attention_slice(module, reversed_slice_size)
906
+
907
+ def _set_gradient_checkpointing(self, module, value=False):
908
+ if isinstance(module, (CrossAttnDownBlock2D, CrossAttnDownBlockMV2D, DownBlock2D, CrossAttnUpBlock2D, CrossAttnUpBlockMV2D, UpBlock2D)):
909
+ module.gradient_checkpointing = value
910
+
911
+ def forward(
912
+ self,
913
+ sample: torch.FloatTensor,
914
+ timestep: Union[torch.Tensor, float, int],
915
+ encoder_hidden_states: torch.Tensor,
916
+ class_labels: Optional[torch.Tensor] = None,
917
+ timestep_cond: Optional[torch.Tensor] = None,
918
+ attention_mask: Optional[torch.Tensor] = None,
919
+ cross_attention_kwargs: Optional[Dict[str, Any]] = None,
920
+ added_cond_kwargs: Optional[Dict[str, torch.Tensor]] = None,
921
+ down_block_additional_residuals: Optional[Tuple[torch.Tensor]] = None,
922
+ mid_block_additional_residual: Optional[torch.Tensor] = None,
923
+ encoder_attention_mask: Optional[torch.Tensor] = None,
924
+ dino_feature: Optional[torch.Tensor] = None,
925
+ return_dict: bool = True,
926
+ vis_max_min: bool = False,
927
+ ) -> Union[UNetMV2DConditionOutput, Tuple]:
928
+ r"""
929
+ The [`UNet2DConditionModel`] forward method.
930
+
931
+ Args:
932
+ sample (`torch.FloatTensor`):
933
+ The noisy input tensor with the following shape `(batch, channel, height, width)`.
934
+ timestep (`torch.FloatTensor` or `float` or `int`): The number of timesteps to denoise an input.
935
+ encoder_hidden_states (`torch.FloatTensor`):
936
+ The encoder hidden states with shape `(batch, sequence_length, feature_dim)`.
937
+ encoder_attention_mask (`torch.Tensor`):
938
+ A cross-attention mask of shape `(batch, sequence_length)` is applied to `encoder_hidden_states`. If
939
+ `True` the mask is kept, otherwise if `False` it is discarded. Mask will be converted into a bias,
940
+ which adds large negative values to the attention scores corresponding to "discard" tokens.
941
+ return_dict (`bool`, *optional*, defaults to `True`):
942
+ Whether or not to return a [`~models.unet_2d_condition.UNet2DConditionOutput`] instead of a plain
943
+ tuple.
944
+ cross_attention_kwargs (`dict`, *optional*):
945
+ A kwargs dictionary that if specified is passed along to the [`AttnProcessor`].
946
+ added_cond_kwargs: (`dict`, *optional*):
947
+ A kwargs dictionary containin additional embeddings that if specified are added to the embeddings that
948
+ are passed along to the UNet blocks.
949
+
950
+ Returns:
951
+ [`~models.unet_2d_condition.UNet2DConditionOutput`] or `tuple`:
952
+ If `return_dict` is True, an [`~models.unet_2d_condition.UNet2DConditionOutput`] is returned, otherwise
953
+ a `tuple` is returned where the first element is the sample tensor.
954
+ """
955
+ record_max_min = {}
956
+ # By default samples have to be AT least a multiple of the overall upsampling factor.
957
+ # The overall upsampling factor is equal to 2 ** (# num of upsampling layers).
958
+ # However, the upsampling interpolation output size can be forced to fit any upsampling size
959
+ # on the fly if necessary.
960
+ default_overall_up_factor = 2**self.num_upsamplers
961
+
962
+ # upsample size should be forwarded when sample is not a multiple of `default_overall_up_factor`
963
+ forward_upsample_size = False
964
+ upsample_size = None
965
+
966
+ if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
967
+ logger.info("Forward upsample size to force interpolation output size.")
968
+ forward_upsample_size = True
969
+
970
+ # ensure attention_mask is a bias, and give it a singleton query_tokens dimension
971
+ # expects mask of shape:
972
+ # [batch, key_tokens]
973
+ # adds singleton query_tokens dimension:
974
+ # [batch, 1, key_tokens]
975
+ # this helps to broadcast it as a bias over attention scores, which will be in one of the following shapes:
976
+ # [batch, heads, query_tokens, key_tokens] (e.g. torch sdp attn)
977
+ # [batch * heads, query_tokens, key_tokens] (e.g. xformers or classic attn)
978
+ if attention_mask is not None:
979
+ # assume that mask is expressed as:
980
+ # (1 = keep, 0 = discard)
981
+ # convert mask into a bias that can be added to attention scores:
982
+ # (keep = +0, discard = -10000.0)
983
+ attention_mask = (1 - attention_mask.to(sample.dtype)) * -10000.0
984
+ attention_mask = attention_mask.unsqueeze(1)
985
+
986
+ # convert encoder_attention_mask to a bias the same way we do for attention_mask
987
+ if encoder_attention_mask is not None:
988
+ encoder_attention_mask = (1 - encoder_attention_mask.to(sample.dtype)) * -10000.0
989
+ encoder_attention_mask = encoder_attention_mask.unsqueeze(1)
990
+
991
+ # 0. center input if necessary
992
+ if self.config.center_input_sample:
993
+ sample = 2 * sample - 1.0
994
+ if vis_max_min: record_max_min["sample"] = (sample.min().detach().float().cpu().numpy().tolist(), sample.max().detach().float().cpu().numpy().tolist())
995
+ # 1. time
996
+ timesteps = timestep
997
+ if not torch.is_tensor(timesteps):
998
+ # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
999
+ # This would be a good case for the `match` statement (Python 3.10+)
1000
+ is_mps = sample.device.type == "mps"
1001
+ if isinstance(timestep, float):
1002
+ dtype = torch.float32 if is_mps else torch.float64
1003
+ else:
1004
+ dtype = torch.int32 if is_mps else torch.int64
1005
+ timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
1006
+ elif len(timesteps.shape) == 0:
1007
+ timesteps = timesteps[None].to(sample.device)
1008
+
1009
+ # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
1010
+ timesteps = timesteps.expand(sample.shape[0])
1011
+
1012
+ t_emb = self.time_proj(timesteps)
1013
+
1014
+ # `Timesteps` does not contain any weights and will always return f32 tensors
1015
+ # but time_embedding might actually be running in fp16. so we need to cast here.
1016
+ # there might be better ways to encapsulate this.
1017
+ t_emb = t_emb.to(dtype=sample.dtype)
1018
+
1019
+ emb = self.time_embedding(t_emb, timestep_cond)
1020
+ aug_emb = None
1021
+ if vis_max_min: record_max_min["t_emb"] = (t_emb.min().detach().float().cpu().numpy().tolist(), t_emb.max().detach().float().cpu().numpy().tolist())
1022
+ if self.class_embedding is not None:
1023
+ if class_labels is None:
1024
+ raise ValueError("class_labels should be provided when num_class_embeds > 0")
1025
+
1026
+ if self.config.class_embed_type == "timestep":
1027
+ class_labels = self.time_proj(class_labels)
1028
+
1029
+ # `Timesteps` does not contain any weights and will always return f32 tensors
1030
+ # there might be better ways to encapsulate this.
1031
+ class_labels = class_labels.to(dtype=sample.dtype)
1032
+
1033
+ class_emb = self.class_embedding(class_labels).to(dtype=sample.dtype)
1034
+ if vis_max_min: record_max_min["class_emb"] = (class_emb.min().detach().float().cpu().numpy().tolist(), class_emb.max().detach().float().cpu().numpy().tolist())
1035
+ if self.config.class_embeddings_concat:
1036
+ emb = torch.cat([emb, class_emb], dim=-1)
1037
+ else:
1038
+ emb = emb + class_emb
1039
+
1040
+ if self.config.addition_embed_type == "text":
1041
+ aug_emb = self.add_embedding(encoder_hidden_states)
1042
+ elif self.config.addition_embed_type == "text_image":
1043
+ # Kandinsky 2.1 - style
1044
+ if "image_embeds" not in added_cond_kwargs:
1045
+ raise ValueError(
1046
+ f"{self.__class__} has the config param `addition_embed_type` set to 'text_image' which requires the keyword argument `image_embeds` to be passed in `added_cond_kwargs`"
1047
+ )
1048
+
1049
+ image_embs = added_cond_kwargs.get("image_embeds")
1050
+ text_embs = added_cond_kwargs.get("text_embeds", encoder_hidden_states)
1051
+ aug_emb = self.add_embedding(text_embs, image_embs)
1052
+ elif self.config.addition_embed_type == "text_time":
1053
+ # SDXL - style
1054
+ if "text_embeds" not in added_cond_kwargs:
1055
+ raise ValueError(
1056
+ f"{self.__class__} has the config param `addition_embed_type` set to 'text_time' which requires the keyword argument `text_embeds` to be passed in `added_cond_kwargs`"
1057
+ )
1058
+ text_embeds = added_cond_kwargs.get("text_embeds")
1059
+ if "time_ids" not in added_cond_kwargs:
1060
+ raise ValueError(
1061
+ f"{self.__class__} has the config param `addition_embed_type` set to 'text_time' which requires the keyword argument `time_ids` to be passed in `added_cond_kwargs`"
1062
+ )
1063
+ time_ids = added_cond_kwargs.get("time_ids")
1064
+ time_embeds = self.add_time_proj(time_ids.flatten())
1065
+ time_embeds = time_embeds.reshape((text_embeds.shape[0], -1))
1066
+
1067
+ add_embeds = torch.concat([text_embeds, time_embeds], dim=-1)
1068
+ add_embeds = add_embeds.to(emb.dtype)
1069
+ aug_emb = self.add_embedding(add_embeds)
1070
+ elif self.config.addition_embed_type == "image":
1071
+ # Kandinsky 2.2 - style
1072
+ if "image_embeds" not in added_cond_kwargs:
1073
+ raise ValueError(
1074
+ f"{self.__class__} has the config param `addition_embed_type` set to 'image' which requires the keyword argument `image_embeds` to be passed in `added_cond_kwargs`"
1075
+ )
1076
+ image_embs = added_cond_kwargs.get("image_embeds")
1077
+ aug_emb = self.add_embedding(image_embs)
1078
+ elif self.config.addition_embed_type == "image_hint":
1079
+ # Kandinsky 2.2 - style
1080
+ if "image_embeds" not in added_cond_kwargs or "hint" not in added_cond_kwargs:
1081
+ raise ValueError(
1082
+ f"{self.__class__} has the config param `addition_embed_type` set to 'image_hint' which requires the keyword arguments `image_embeds` and `hint` to be passed in `added_cond_kwargs`"
1083
+ )
1084
+ image_embs = added_cond_kwargs.get("image_embeds")
1085
+ hint = added_cond_kwargs.get("hint")
1086
+ aug_emb, hint = self.add_embedding(image_embs, hint)
1087
+ sample = torch.cat([sample, hint], dim=1)
1088
+
1089
+ emb = emb + aug_emb if aug_emb is not None else emb
1090
+ if aug_emb is not None and vis_max_min: record_max_min["aug_emb"] = (aug_emb.min().detach().float().cpu().numpy().tolist(), aug_emb.max().detach().float().cpu().numpy().tolist())
1091
+ emb_pre_act = emb
1092
+ if self.time_embed_act is not None:
1093
+ emb = self.time_embed_act(emb)
1094
+
1095
+ if self.encoder_hid_proj is not None and self.config.encoder_hid_dim_type == "text_proj":
1096
+ encoder_hidden_states = self.encoder_hid_proj(encoder_hidden_states)
1097
+ elif self.encoder_hid_proj is not None and self.config.encoder_hid_dim_type == "text_image_proj":
1098
+ # Kadinsky 2.1 - style
1099
+ if "image_embeds" not in added_cond_kwargs:
1100
+ raise ValueError(
1101
+ f"{self.__class__} has the config param `encoder_hid_dim_type` set to 'text_image_proj' which requires the keyword argument `image_embeds` to be passed in `added_conditions`"
1102
+ )
1103
+
1104
+ image_embeds = added_cond_kwargs.get("image_embeds")
1105
+ encoder_hidden_states = self.encoder_hid_proj(encoder_hidden_states, image_embeds)
1106
+ elif self.encoder_hid_proj is not None and self.config.encoder_hid_dim_type == "image_proj":
1107
+ # Kandinsky 2.2 - style
1108
+ if "image_embeds" not in added_cond_kwargs:
1109
+ raise ValueError(
1110
+ f"{self.__class__} has the config param `encoder_hid_dim_type` set to 'image_proj' which requires the keyword argument `image_embeds` to be passed in `added_conditions`"
1111
+ )
1112
+ image_embeds = added_cond_kwargs.get("image_embeds")
1113
+ encoder_hidden_states = self.encoder_hid_proj(image_embeds)
1114
+ # 2. pre-process
1115
+ sample = self.conv_in(sample)
1116
+ if vis_max_min: record_max_min["conv_in"] = (sample.min().detach().float().cpu().numpy().tolist(), sample.max().detach().float().cpu().numpy().tolist())
1117
+ # 3. down
1118
+
1119
+ is_controlnet = mid_block_additional_residual is not None and down_block_additional_residuals is not None
1120
+ is_adapter = mid_block_additional_residual is None and down_block_additional_residuals is not None
1121
+
1122
+ down_block_res_samples = (sample,)
1123
+ for i, downsample_block in enumerate(self.down_blocks):
1124
+ if hasattr(downsample_block, "has_cross_attention") and downsample_block.has_cross_attention:
1125
+ # For t2i-adapter CrossAttnDownBlock2D
1126
+ additional_residuals = {}
1127
+ if is_adapter and len(down_block_additional_residuals) > 0:
1128
+ additional_residuals["additional_residuals"] = down_block_additional_residuals.pop(0)
1129
+
1130
+ sample, res_samples = downsample_block(
1131
+ hidden_states=sample,
1132
+ temb=emb,
1133
+ encoder_hidden_states=encoder_hidden_states,
1134
+ dino_feature=dino_feature,
1135
+ attention_mask=attention_mask,
1136
+ cross_attention_kwargs=cross_attention_kwargs,
1137
+ encoder_attention_mask=encoder_attention_mask,
1138
+ **additional_residuals,
1139
+ )
1140
+ else:
1141
+ sample, res_samples = downsample_block(hidden_states=sample, temb=emb)
1142
+
1143
+ if is_adapter and len(down_block_additional_residuals) > 0:
1144
+ sample += down_block_additional_residuals.pop(0)
1145
+
1146
+ down_block_res_samples += res_samples
1147
+ if vis_max_min: record_max_min[f"down_block_{i}"] = (sample.min().detach().float().cpu().numpy().tolist(), sample.max().detach().float().cpu().numpy().tolist())
1148
+
1149
+ if is_controlnet:
1150
+ new_down_block_res_samples = ()
1151
+
1152
+ for down_block_res_sample, down_block_additional_residual in zip(
1153
+ down_block_res_samples, down_block_additional_residuals
1154
+ ):
1155
+ down_block_res_sample = down_block_res_sample + down_block_additional_residual
1156
+ new_down_block_res_samples = new_down_block_res_samples + (down_block_res_sample,)
1157
+
1158
+ down_block_res_samples = new_down_block_res_samples
1159
+
1160
+ if self.addition_downsample:
1161
+ global_sample = sample
1162
+ global_sample = self.downsample(global_sample)
1163
+ for layer in self.conv_block:
1164
+ global_sample = layer(global_sample)
1165
+ global_sample = self.addition_act_out(self.addition_conv_out(global_sample))
1166
+ global_sample = self.upsample(global_sample)
1167
+ if vis_max_min: record_max_min["global_sample"] = (global_sample.min().detach().float().cpu().numpy().tolist(), global_sample.max().detach().float().cpu().numpy().tolist())
1168
+ # 4. mid
1169
+ if self.mid_block is not None:
1170
+ sample = self.mid_block(
1171
+ sample,
1172
+ emb,
1173
+ encoder_hidden_states=encoder_hidden_states,
1174
+ dino_feature=dino_feature,
1175
+ attention_mask=attention_mask,
1176
+ cross_attention_kwargs=cross_attention_kwargs,
1177
+ encoder_attention_mask=encoder_attention_mask,
1178
+ )
1179
+ if vis_max_min: record_max_min["mid_block"] = (sample.min().detach().float().cpu().numpy().tolist(), sample.max().detach().float().cpu().numpy().tolist())
1180
+
1181
+ # 4.1 regress elevation and focal length
1182
+ # # predict elevation -> embed -> projection -> add to time emb
1183
+ if self.regress_elevation or self.regress_focal_length:
1184
+ pool_embeds = self.pool(sample.detach()).squeeze(-1).squeeze(-1) # (2B, C)
1185
+ if self.mvcd_attention:
1186
+ pool_embeds_normal, pool_embeds_color = torch.chunk(pool_embeds, 2, dim=0)
1187
+ pool_embeds = torch.cat([pool_embeds_normal, pool_embeds_color], dim=-1) # (B, 2C)
1188
+ pose_pred = []
1189
+ if self.regress_elevation:
1190
+ ele_pred = self.elevation_regressor(pool_embeds)
1191
+ ele_pred = rearrange(ele_pred, '(b v) c -> b v c', v=self.num_views)
1192
+ ele_pred = torch.mean(ele_pred, dim=1)
1193
+ pose_pred.append(ele_pred) # b, c
1194
+ if vis_max_min: record_max_min["ele_pred"] = (ele_pred.min().detach().float().cpu().numpy().tolist(), ele_pred.max().detach().float().cpu().numpy().tolist())
1195
+
1196
+ if self.regress_focal_length:
1197
+ focal_pred = self.focal_regressor(pool_embeds)
1198
+ focal_pred = rearrange(focal_pred, '(b v) c -> b v c', v=self.num_views)
1199
+ focal_pred = torch.mean(focal_pred, dim=1)
1200
+ pose_pred.append(focal_pred)
1201
+ if vis_max_min: record_max_min["focal_pred"] = (focal_pred.min().detach().float().cpu().numpy().tolist(), focal_pred.max().detach().float().cpu().numpy().tolist())
1202
+ pose_pred = torch.cat(pose_pred, dim=-1)
1203
+ # 'e_de_da_sincos', (B, 2)
1204
+ pose_embeds = torch.cat([
1205
+ torch.sin(pose_pred),
1206
+ torch.cos(pose_pred)
1207
+ ], dim=-1)
1208
+ pose_embeds = self.camera_embedding(pose_embeds)
1209
+ pose_embeds = torch.repeat_interleave(pose_embeds, self.num_views, 0)
1210
+ if vis_max_min: record_max_min["pose_embeds"] = (pose_embeds.min().detach().float().cpu().numpy().tolist(), pose_embeds.max().detach().float().cpu().numpy().tolist())
1211
+ if self.mvcd_attention:
1212
+ pose_embeds = torch.cat([pose_embeds,] * 2, dim=0)
1213
+
1214
+ emb = pose_embeds + emb_pre_act
1215
+ if self.time_embed_act is not None:
1216
+ emb = self.time_embed_act(emb)
1217
+
1218
+ '''
1219
+ if self.regress_elevation or self.regress_focal_length:
1220
+ pose_pred = self.pool(sample.detach()).squeeze(-1).squeeze(-1) # (B, C)
1221
+
1222
+ for liner in self.elevation_regressor:
1223
+ pose_pred = liner(pose_pred)
1224
+
1225
+ pose_pred = torch.cat([
1226
+ pose_pred[:, 0:1],
1227
+ self.focal_act(pose_pred[:, 1:])
1228
+ ], dim=-1)
1229
+ # 'e_de_da_sincos', (B, 2)
1230
+ pose_embeds = torch.cat([
1231
+ torch.sin(pose_pred),
1232
+ torch.cos(pose_pred)
1233
+ ], dim=-1)
1234
+ pose_embeds = self.camera_embedding(pose_embeds)
1235
+ emb = pose_embeds + emb_pre_act
1236
+ if self.time_embed_act is not None:
1237
+ emb = self.time_embed_act(emb)
1238
+ '''
1239
+ if is_controlnet:
1240
+ sample = sample + mid_block_additional_residual
1241
+
1242
+ if self.addition_downsample:
1243
+ sample = sample + global_sample
1244
+
1245
+ # 5. up
1246
+ for i, upsample_block in enumerate(self.up_blocks):
1247
+ is_final_block = i == len(self.up_blocks) - 1
1248
+
1249
+ res_samples = down_block_res_samples[-len(upsample_block.resnets) :]
1250
+ down_block_res_samples = down_block_res_samples[: -len(upsample_block.resnets)]
1251
+
1252
+ # if we have not reached the final block and need to forward the
1253
+ # upsample size, we do it here
1254
+ if not is_final_block and forward_upsample_size:
1255
+ upsample_size = down_block_res_samples[-1].shape[2:]
1256
+
1257
+ if hasattr(upsample_block, "has_cross_attention") and upsample_block.has_cross_attention:
1258
+ sample = upsample_block(
1259
+ hidden_states=sample,
1260
+ temb=emb,
1261
+ res_hidden_states_tuple=res_samples,
1262
+ encoder_hidden_states=encoder_hidden_states,
1263
+ dino_feature=dino_feature,
1264
+ cross_attention_kwargs=cross_attention_kwargs,
1265
+ upsample_size=upsample_size,
1266
+ attention_mask=attention_mask,
1267
+ encoder_attention_mask=encoder_attention_mask,
1268
+ )
1269
+ else:
1270
+ sample = upsample_block(
1271
+ hidden_states=sample, temb=emb, res_hidden_states_tuple=res_samples, upsample_size=upsample_size
1272
+ )
1273
+ if vis_max_min: record_max_min[f"upsample_block_{i}"] = (torch.abs(sample.min().detach().float()).cpu().numpy().tolist(), sample.max().detach().float().cpu().numpy().tolist())
1274
+ up_s = sample
1275
+ if torch.isnan(sample).any() or torch.isinf(sample).any():
1276
+ print("NAN in sample, stop training.")
1277
+ exit()
1278
+ # 6. post-process
1279
+ if self.conv_norm_out:
1280
+ sample = self.conv_norm_out(sample)
1281
+ if vis_max_min: record_max_min[f"conv_norm_out"] = (sample.min().detach().float().cpu().numpy().tolist(), sample.max().detach().float().cpu().numpy().tolist())
1282
+ sample = self.conv_act(sample)
1283
+ sample = self.conv_out(sample)
1284
+ if vis_max_min: record_max_min[f"conv_out"] = (sample.min().detach().float().cpu().numpy().tolist(), sample.max().detach().float().cpu().numpy().tolist())
1285
+ if not return_dict:
1286
+ return (sample,)
1287
+ # return (sample, pose_pred)
1288
+ if self.regress_elevation or self.regress_focal_length:
1289
+ return UNetMV2DConditionOutput(sample=sample), pose_pred, record_max_min, up_s
1290
+ else:
1291
+ return UNetMV2DConditionOutput(sample=sample), up_s
1292
+ # return UNetMV2DConditionOutput(sample=sample), up_s, record_max_min
1293
+
1294
+ @classmethod
1295
+ def from_pretrained_2d(
1296
+ cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],
1297
+ camera_embedding_type: str, num_views: int, sample_size: int,
1298
+ zero_init_conv_in: bool = True, zero_init_camera_projection: bool = False,
1299
+ projection_camera_embeddings_input_dim: int=2,
1300
+ cd_attention_last: bool = False, num_regress_blocks: int = 4,
1301
+ cd_attention_mid: bool = False, multiview_attention: bool = True,
1302
+ sparse_mv_attention: bool = False, selfattn_block: str = 'custom', mvcd_attention: bool = False,
1303
+ in_channels: int = 8, out_channels: int = 4, unclip: bool = False, regress_elevation: bool = False, regress_focal_length: bool = False,
1304
+ init_mvattn_with_selfattn: bool= False, use_dino: bool = False, addition_downsample: bool = False, use_face_adapter: bool=True,
1305
+ **kwargs
1306
+ ):
1307
+ r"""
1308
+ Instantiate a pretrained PyTorch model from a pretrained model configuration.
1309
+
1310
+ The model is set in evaluation mode - `model.eval()` - by default, and dropout modules are deactivated. To
1311
+ train the model, set it back in training mode with `model.train()`.
1312
+
1313
+ Parameters:
1314
+ pretrained_model_name_or_path (`str` or `os.PathLike`, *optional*):
1315
+ Can be either:
1316
+
1317
+ - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
1318
+ the Hub.
1319
+ - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
1320
+ with [`~ModelMixin.save_pretrained`].
1321
+
1322
+ cache_dir (`Union[str, os.PathLike]`, *optional*):
1323
+ Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
1324
+ is not used.
1325
+ torch_dtype (`str` or `torch.dtype`, *optional*):
1326
+ Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the
1327
+ dtype is automatically derived from the model's weights.
1328
+ force_download (`bool`, *optional*, defaults to `False`):
1329
+ Whether or not to force the (re-)download of the model weights and configuration files, overriding the
1330
+ cached versions if they exist.
1331
+ resume_download (`bool`, *optional*, defaults to `False`):
1332
+ Whether or not to resume downloading the model weights and configuration files. If set to `False`, any
1333
+ incompletely downloaded files are deleted.
1334
+ proxies (`Dict[str, str]`, *optional*):
1335
+ A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
1336
+ 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
1337
+ output_loading_info (`bool`, *optional*, defaults to `False`):
1338
+ Whether or not to also return a dictionary containing missing keys, unexpected keys and error messages.
1339
+ local_files_only(`bool`, *optional*, defaults to `False`):
1340
+ Whether to only load local model weights and configuration files or not. If set to `True`, the model
1341
+ won't be downloaded from the Hub.
1342
+ use_auth_token (`str` or *bool*, *optional*):
1343
+ The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
1344
+ `diffusers-cli login` (stored in `~/.huggingface`) is used.
1345
+ revision (`str`, *optional*, defaults to `"main"`):
1346
+ The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
1347
+ allowed by Git.
1348
+ from_flax (`bool`, *optional*, defaults to `False`):
1349
+ Load the model weights from a Flax checkpoint save file.
1350
+ subfolder (`str`, *optional*, defaults to `""`):
1351
+ The subfolder location of a model file within a larger model repository on the Hub or locally.
1352
+ mirror (`str`, *optional*):
1353
+ Mirror source to resolve accessibility issues if you're downloading a model in China. We do not
1354
+ guarantee the timeliness or safety of the source, and you should refer to the mirror site for more
1355
+ information.
1356
+ device_map (`str` or `Dict[str, Union[int, str, torch.device]]`, *optional*):
1357
+ A map that specifies where each submodule should go. It doesn't need to be defined for each
1358
+ parameter/buffer name; once a given module name is inside, every submodule of it will be sent to the
1359
+ same device.
1360
+
1361
+ Set `device_map="auto"` to have 🤗 Accelerate automatically compute the most optimized `device_map`. For
1362
+ more information about each option see [designing a device
1363
+ map](https://hf.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map).
1364
+ max_memory (`Dict`, *optional*):
1365
+ A dictionary device identifier for the maximum memory. Will default to the maximum memory available for
1366
+ each GPU and the available CPU RAM if unset.
1367
+ offload_folder (`str` or `os.PathLike`, *optional*):
1368
+ The path to offload weights if `device_map` contains the value `"disk"`.
1369
+ offload_state_dict (`bool`, *optional*):
1370
+ If `True`, temporarily offloads the CPU state dict to the hard drive to avoid running out of CPU RAM if
1371
+ the weight of the CPU state dict + the biggest shard of the checkpoint does not fit. Defaults to `True`
1372
+ when there is some disk offload.
1373
+ low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
1374
+ Speed up model loading only loading the pretrained weights and not initializing the weights. This also
1375
+ tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
1376
+ Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
1377
+ argument to `True` will raise an error.
1378
+ variant (`str`, *optional*):
1379
+ Load weights from a specified `variant` filename such as `"fp16"` or `"ema"`. This is ignored when
1380
+ loading `from_flax`.
1381
+ use_safetensors (`bool`, *optional*, defaults to `None`):
1382
+ If set to `None`, the `safetensors` weights are downloaded if they're available **and** if the
1383
+ `safetensors` library is installed. If set to `True`, the model is forcibly loaded from `safetensors`
1384
+ weights. If set to `False`, `safetensors` weights are not loaded.
1385
+
1386
+ <Tip>
1387
+
1388
+ To use private or [gated models](https://huggingface.co/docs/hub/models-gated#gated-models), log-in with
1389
+ `huggingface-cli login`. You can also activate the special
1390
+ ["offline-mode"](https://huggingface.co/diffusers/installation.html#offline-mode) to use this method in a
1391
+ firewalled environment.
1392
+
1393
+ </Tip>
1394
+
1395
+ Example:
1396
+
1397
+ ```py
1398
+ from diffusers import UNet2DConditionModel
1399
+
1400
+ unet = UNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="unet")
1401
+ ```
1402
+
1403
+ If you get the error message below, you need to finetune the weights for your downstream task:
1404
+
1405
+ ```bash
1406
+ Some weights of UNet2DConditionModel were not initialized from the model checkpoint at runwayml/stable-diffusion-v1-5 and are newly initialized because the shapes did not match:
1407
+ - conv_in.weight: found shape torch.Size([320, 4, 3, 3]) in the checkpoint and torch.Size([320, 9, 3, 3]) in the model instantiated
1408
+ You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
1409
+ ```
1410
+ """
1411
+ cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE)
1412
+ ignore_mismatched_sizes = kwargs.pop("ignore_mismatched_sizes", False)
1413
+ force_download = kwargs.pop("force_download", False)
1414
+ from_flax = kwargs.pop("from_flax", False)
1415
+ resume_download = kwargs.pop("resume_download", False)
1416
+ proxies = kwargs.pop("proxies", None)
1417
+ output_loading_info = kwargs.pop("output_loading_info", False)
1418
+ local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE)
1419
+ use_auth_token = kwargs.pop("use_auth_token", None)
1420
+ revision = kwargs.pop("revision", None)
1421
+ torch_dtype = kwargs.pop("torch_dtype", None)
1422
+ subfolder = kwargs.pop("subfolder", None)
1423
+ device_map = kwargs.pop("device_map", None)
1424
+ max_memory = kwargs.pop("max_memory", None)
1425
+ offload_folder = kwargs.pop("offload_folder", None)
1426
+ offload_state_dict = kwargs.pop("offload_state_dict", False)
1427
+ variant = kwargs.pop("variant", None)
1428
+ use_safetensors = kwargs.pop("use_safetensors", None)
1429
+
1430
+ if use_safetensors:
1431
+ raise ValueError(
1432
+ "`use_safetensors`=True but safetensors is not installed. Please install safetensors with `pip install safetensors"
1433
+ )
1434
+
1435
+ allow_pickle = False
1436
+ if use_safetensors is None:
1437
+ use_safetensors = True
1438
+ allow_pickle = True
1439
+
1440
+ if device_map is not None and not is_accelerate_available():
1441
+ raise NotImplementedError(
1442
+ "Loading and dispatching requires `accelerate`. Please make sure to install accelerate or set"
1443
+ " `device_map=None`. You can install accelerate with `pip install accelerate`."
1444
+ )
1445
+
1446
+ # Check if we can handle device_map and dispatching the weights
1447
+ if device_map is not None and not is_torch_version(">=", "1.9.0"):
1448
+ raise NotImplementedError(
1449
+ "Loading and dispatching requires torch >= 1.9.0. Please either update your PyTorch version or set"
1450
+ " `device_map=None`."
1451
+ )
1452
+
1453
+ # Load config if we don't provide a configuration
1454
+ config_path = pretrained_model_name_or_path
1455
+
1456
+ user_agent = {
1457
+ "diffusers": __version__,
1458
+ "file_type": "model",
1459
+ "framework": "pytorch",
1460
+ }
1461
+
1462
+ # load config
1463
+ config, unused_kwargs, commit_hash = cls.load_config(
1464
+ config_path,
1465
+ cache_dir=cache_dir,
1466
+ return_unused_kwargs=True,
1467
+ return_commit_hash=True,
1468
+ force_download=force_download,
1469
+ resume_download=resume_download,
1470
+ proxies=proxies,
1471
+ local_files_only=local_files_only,
1472
+ use_auth_token=use_auth_token,
1473
+ revision=revision,
1474
+ subfolder=subfolder,
1475
+ device_map=device_map,
1476
+ max_memory=max_memory,
1477
+ offload_folder=offload_folder,
1478
+ offload_state_dict=offload_state_dict,
1479
+ user_agent=user_agent,
1480
+ **kwargs,
1481
+ )
1482
+
1483
+ # modify config
1484
+ config["_class_name"] = cls.__name__
1485
+ config['in_channels'] = in_channels
1486
+ config['out_channels'] = out_channels
1487
+ config['sample_size'] = sample_size # training resolution
1488
+ config['num_views'] = num_views
1489
+ config['cd_attention_last'] = cd_attention_last
1490
+ config['cd_attention_mid'] = cd_attention_mid
1491
+ config['multiview_attention'] = multiview_attention
1492
+ config['sparse_mv_attention'] = sparse_mv_attention
1493
+ config['selfattn_block'] = selfattn_block
1494
+ config['mvcd_attention'] = mvcd_attention
1495
+ config["down_block_types"] = [
1496
+ "CrossAttnDownBlockMV2D",
1497
+ "CrossAttnDownBlockMV2D",
1498
+ "CrossAttnDownBlockMV2D",
1499
+ "DownBlock2D"
1500
+ ]
1501
+ config['mid_block_type'] = "UNetMidBlockMV2DCrossAttn"
1502
+ config["up_block_types"] = [
1503
+ "UpBlock2D",
1504
+ "CrossAttnUpBlockMV2D",
1505
+ "CrossAttnUpBlockMV2D",
1506
+ "CrossAttnUpBlockMV2D"
1507
+ ]
1508
+
1509
+
1510
+ config['regress_elevation'] = regress_elevation # true
1511
+ config['regress_focal_length'] = regress_focal_length # true
1512
+ config['projection_camera_embeddings_input_dim'] = projection_camera_embeddings_input_dim # 2 for elevation and 10 for focal_length
1513
+ config['use_dino'] = use_dino
1514
+ config['num_regress_blocks'] = num_regress_blocks
1515
+ config['addition_downsample'] = addition_downsample
1516
+ # load model
1517
+ model_file = None
1518
+ if from_flax:
1519
+ raise NotImplementedError
1520
+ else:
1521
+ if use_safetensors:
1522
+ try:
1523
+ model_file = _get_model_file(
1524
+ pretrained_model_name_or_path,
1525
+ weights_name=_add_variant(SAFETENSORS_WEIGHTS_NAME, variant),
1526
+ cache_dir=cache_dir,
1527
+ force_download=force_download,
1528
+ resume_download=resume_download,
1529
+ proxies=proxies,
1530
+ local_files_only=local_files_only,
1531
+ use_auth_token=use_auth_token,
1532
+ revision=revision,
1533
+ subfolder=subfolder,
1534
+ user_agent=user_agent,
1535
+ commit_hash=commit_hash,
1536
+ )
1537
+ except IOError as e:
1538
+ if not allow_pickle:
1539
+ raise e
1540
+ pass
1541
+ if model_file is None:
1542
+ model_file = _get_model_file(
1543
+ pretrained_model_name_or_path,
1544
+ weights_name=_add_variant(WEIGHTS_NAME, variant),
1545
+ cache_dir=cache_dir,
1546
+ force_download=force_download,
1547
+ resume_download=resume_download,
1548
+ proxies=proxies,
1549
+ local_files_only=local_files_only,
1550
+ use_auth_token=use_auth_token,
1551
+ revision=revision,
1552
+ subfolder=subfolder,
1553
+ user_agent=user_agent,
1554
+ commit_hash=commit_hash,
1555
+ )
1556
+
1557
+ model = cls.from_config(config, **unused_kwargs)
1558
+ import copy
1559
+ state_dict_pretrain = load_state_dict(model_file, variant=variant)
1560
+ state_dict = copy.deepcopy(state_dict_pretrain)
1561
+ if init_mvattn_with_selfattn:
1562
+ for key in state_dict_pretrain:
1563
+ if 'attn1' in key:
1564
+ key_mv = key.replace('attn1', 'attn_mv')
1565
+ state_dict[key_mv] = state_dict_pretrain[key]
1566
+ if 'to_out.0.weight' in key:
1567
+ nn.init.zeros_(state_dict[key_mv].data)
1568
+ if 'transformer_blocks' in key and 'norm1' in key: # in case that initialize the norm layer in resnet block
1569
+ key_mv = key.replace('norm1', 'norm_mv')
1570
+ state_dict[key_mv] = state_dict_pretrain[key]
1571
+ del state_dict_pretrain
1572
+
1573
+ model._convert_deprecated_attention_blocks(state_dict)
1574
+
1575
+ conv_in_weight = state_dict['conv_in.weight']
1576
+ conv_out_weight = state_dict['conv_out.weight']
1577
+ model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model_2d(
1578
+ model,
1579
+ state_dict,
1580
+ model_file,
1581
+ pretrained_model_name_or_path,
1582
+ ignore_mismatched_sizes=True,
1583
+ )
1584
+ if any([key == 'conv_in.weight' for key, _, _ in mismatched_keys]):
1585
+ # initialize from the original SD structure
1586
+ model.conv_in.weight.data[:,:4] = conv_in_weight
1587
+
1588
+ # whether to place all zero to new layers?
1589
+ if zero_init_conv_in:
1590
+ model.conv_in.weight.data[:,4:] = 0.
1591
+
1592
+ if any([key == 'conv_out.weight' for key, _, _ in mismatched_keys]):
1593
+ # initialize from the original SD structure
1594
+ model.conv_out.weight.data[:,:4] = conv_out_weight
1595
+ if out_channels == 8: # copy for the last 4 channels
1596
+ model.conv_out.weight.data[:, 4:] = conv_out_weight
1597
+
1598
+ if (regress_elevation or regress_focal_length) and zero_init_camera_projection: # true
1599
+ params = [p for p in model.camera_embedding.parameters()]
1600
+ torch.nn.init.zeros_(params[-1].data)
1601
+
1602
+ loading_info = {
1603
+ "missing_keys": missing_keys,
1604
+ "unexpected_keys": unexpected_keys,
1605
+ "mismatched_keys": mismatched_keys,
1606
+ "error_msgs": error_msgs,
1607
+ }
1608
+
1609
+ if torch_dtype is not None and not isinstance(torch_dtype, torch.dtype):
1610
+ raise ValueError(
1611
+ f"{torch_dtype} needs to be of type `torch.dtype`, e.g. `torch.float16`, but is {type(torch_dtype)}."
1612
+ )
1613
+ elif torch_dtype is not None:
1614
+ model = model.to(torch_dtype)
1615
+
1616
+ model.register_to_config(_name_or_path=pretrained_model_name_or_path)
1617
+
1618
+ # Set model in evaluation mode to deactivate DropOut modules by default
1619
+ model.eval()
1620
+ if output_loading_info:
1621
+ return model, loading_info
1622
+ return model
1623
+
1624
+ @classmethod
1625
+ def _load_pretrained_model_2d(
1626
+ cls,
1627
+ model,
1628
+ state_dict,
1629
+ resolved_archive_file,
1630
+ pretrained_model_name_or_path,
1631
+ ignore_mismatched_sizes=False,
1632
+ ):
1633
+ # Retrieve missing & unexpected_keys
1634
+ model_state_dict = model.state_dict()
1635
+ loaded_keys = list(state_dict.keys())
1636
+
1637
+ expected_keys = list(model_state_dict.keys())
1638
+
1639
+ original_loaded_keys = loaded_keys
1640
+
1641
+ missing_keys = list(set(expected_keys) - set(loaded_keys))
1642
+ unexpected_keys = list(set(loaded_keys) - set(expected_keys))
1643
+
1644
+ # Make sure we are able to load base models as well as derived models (with heads)
1645
+ model_to_load = model
1646
+
1647
+ def _find_mismatched_keys(
1648
+ state_dict,
1649
+ model_state_dict,
1650
+ loaded_keys,
1651
+ ignore_mismatched_sizes,
1652
+ ):
1653
+ mismatched_keys = []
1654
+ if ignore_mismatched_sizes:
1655
+ for checkpoint_key in loaded_keys:
1656
+ model_key = checkpoint_key
1657
+
1658
+ if (
1659
+ model_key in model_state_dict
1660
+ and state_dict[checkpoint_key].shape != model_state_dict[model_key].shape
1661
+ ):
1662
+ mismatched_keys.append(
1663
+ (checkpoint_key, state_dict[checkpoint_key].shape, model_state_dict[model_key].shape)
1664
+ )
1665
+ del state_dict[checkpoint_key]
1666
+ return mismatched_keys
1667
+
1668
+ if state_dict is not None:
1669
+ # Whole checkpoint
1670
+ mismatched_keys = _find_mismatched_keys(
1671
+ state_dict,
1672
+ model_state_dict,
1673
+ original_loaded_keys,
1674
+ ignore_mismatched_sizes,
1675
+ )
1676
+ error_msgs = _load_state_dict_into_model(model_to_load, state_dict)
1677
+
1678
+ if len(error_msgs) > 0:
1679
+ error_msg = "\n\t".join(error_msgs)
1680
+ if "size mismatch" in error_msg:
1681
+ error_msg += (
1682
+ "\n\tYou may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method."
1683
+ )
1684
+ raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
1685
+
1686
+ if len(unexpected_keys) > 0:
1687
+ logger.warning(
1688
+ f"Some weights of the model checkpoint at {pretrained_model_name_or_path} were not used when"
1689
+ f" initializing {model.__class__.__name__}: {unexpected_keys}\n- This IS expected if you are"
1690
+ f" initializing {model.__class__.__name__} from the checkpoint of a model trained on another task"
1691
+ " or with another architecture (e.g. initializing a BertForSequenceClassification model from a"
1692
+ " BertForPreTraining model).\n- This IS NOT expected if you are initializing"
1693
+ f" {model.__class__.__name__} from the checkpoint of a model that you expect to be exactly"
1694
+ " identical (initializing a BertForSequenceClassification model from a"
1695
+ " BertForSequenceClassification model)."
1696
+ )
1697
+ else:
1698
+ logger.info(f"All model checkpoint weights were used when initializing {model.__class__.__name__}.\n")
1699
+ if len(missing_keys) > 0:
1700
+ logger.warning(
1701
+ f"Some weights of {model.__class__.__name__} were not initialized from the model checkpoint at"
1702
+ f" {pretrained_model_name_or_path} and are newly initialized: {missing_keys}\nYou should probably"
1703
+ " TRAIN this model on a down-stream task to be able to use it for predictions and inference."
1704
+ )
1705
+ elif len(mismatched_keys) == 0:
1706
+ logger.info(
1707
+ f"All the weights of {model.__class__.__name__} were initialized from the model checkpoint at"
1708
+ f" {pretrained_model_name_or_path}.\nIf your task is similar to the task the model of the"
1709
+ f" checkpoint was trained on, you can already use {model.__class__.__name__} for predictions"
1710
+ " without further training."
1711
+ )
1712
+ if len(mismatched_keys) > 0:
1713
+ mismatched_warning = "\n".join(
1714
+ [
1715
+ f"- {key}: found shape {shape1} in the checkpoint and {shape2} in the model instantiated"
1716
+ for key, shape1, shape2 in mismatched_keys
1717
+ ]
1718
+ )
1719
+ logger.warning(
1720
+ f"Some weights of {model.__class__.__name__} were not initialized from the model checkpoint at"
1721
+ f" {pretrained_model_name_or_path} and are newly initialized because the shapes did not"
1722
+ f" match:\n{mismatched_warning}\nYou should probably TRAIN this model on a down-stream task to be"
1723
+ " able to use it for predictions and inference."
1724
+ )
1725
+
1726
+ return model, missing_keys, unexpected_keys, mismatched_keys, error_msgs
1727
+