File size: 4,580 Bytes
5a510e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
"""
This module implements the FaceLocator class, which is a neural network model designed to
locate and extract facial features from input images or tensors. It uses a series of
convolutional layers to progressively downsample and refine the facial feature map.

The FaceLocator class is part of a larger system that may involve facial recognition or
similar tasks where precise location and extraction of facial features are required.

Attributes:
    conditioning_embedding_channels (int): The number of channels in the output embedding.
    conditioning_channels (int): The number of input channels for the conditioning tensor.
    block_out_channels (Tuple[int]): A tuple of integers representing the output channels
        for each block in the model.

The model uses the following components:
- InflatedConv3d: A convolutional layer that inflates the input to increase the depth.
- zero_module: A utility function that may set certain parameters to zero for regularization
    or other purposes.

The forward method of the FaceLocator class takes a conditioning tensor as input and
produces an embedding tensor as output, which can be used for further processing or analysis.
"""

from typing import Tuple

import torch.nn.functional as F
from diffusers.models.modeling_utils import ModelMixin
from torch import nn

from .motion_module import zero_module
from .resnet import InflatedConv3d


class FaceLocator(ModelMixin):
    """
    The FaceLocator class is a neural network model designed to process and extract facial
    features from an input tensor. It consists of a series of convolutional layers that
    progressively downsample the input while increasing the depth of the feature map.

    The model is built using InflatedConv3d layers, which are designed to inflate the
    feature channels, allowing for more complex feature extraction. The final output is a
    conditioning embedding that can be used for various tasks such as facial recognition or
    feature-based image manipulation.

    Parameters:
        conditioning_embedding_channels (int): The number of channels in the output embedding.
        conditioning_channels (int, optional): The number of input channels for the conditioning tensor. Default is 3.
        block_out_channels (Tuple[int], optional): A tuple of integers representing the output channels
            for each block in the model. The default is (16, 32, 64, 128), which defines the
            progression of the network's depth.

    Attributes:
        conv_in (InflatedConv3d): The initial convolutional layer that starts the feature extraction process.
        blocks (ModuleList[InflatedConv3d]): A list of convolutional layers that form the core of the model.
        conv_out (InflatedConv3d): The final convolutional layer that produces the output embedding.

    The forward method applies the convolutional layers to the input conditioning tensor and
    returns the resulting embedding tensor.
    """
    def __init__(
        self,
        conditioning_embedding_channels: int,
        conditioning_channels: int = 3,
        block_out_channels: Tuple[int] = (16, 32, 64, 128),
    ):
        super().__init__()
        self.conv_in = InflatedConv3d(
            conditioning_channels, block_out_channels[0], kernel_size=3, padding=1
        )

        self.blocks = nn.ModuleList([])

        for i in range(len(block_out_channels) - 1):
            channel_in = block_out_channels[i]
            channel_out = block_out_channels[i + 1]
            self.blocks.append(
                InflatedConv3d(channel_in, channel_in, kernel_size=3, padding=1)
            )
            self.blocks.append(
                InflatedConv3d(
                    channel_in, channel_out, kernel_size=3, padding=1, stride=2
                )
            )

        self.conv_out = zero_module(
            InflatedConv3d(
                block_out_channels[-1],
                conditioning_embedding_channels,
                kernel_size=3,
                padding=1,
            )
        )

    def forward(self, conditioning):
        """
        Forward pass of the FaceLocator model.

        Args:
            conditioning (Tensor): The input conditioning tensor.

        Returns:
            Tensor: The output embedding tensor.
        """
        embedding = self.conv_in(conditioning)
        embedding = F.silu(embedding)

        for block in self.blocks:
            embedding = block(embedding)
            embedding = F.silu(embedding)

        embedding = self.conv_out(embedding)

        return embedding