Upload model

Browse files

Files changed (5) hide show

README.md +199 -0
config.json +71 -0
configuration_solider.py +80 -0
model.safetensors +3 -0
modeling_solider.py +1840 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "act_cfg": {
+    "type": "GELU"
+  },
+  "architectures": [
+    "SOLIDERModel"
+  ],
+  "attn_drop_rate": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_solider.SOLIDERConfig",
+    "AutoModel": "modeling_solider.SOLIDERModel"
+  },
+  "convert_weights": false,
+  "depths": [
+    2,
+    2,
+    18,
+    2
+  ],
+  "drop_path_rate": 0.0,
+  "drop_rate": 0.0,
+  "embed_dims": 128,
+  "frozen_stages": -1,
+  "hidden_size": 128,
+  "img_size": [
+    224,
+    224
+  ],
+  "in_channels": 3,
+  "init_cfg": null,
+  "mlp_ratio": 4,
+  "model_type": "swin_transformer",
+  "name": "solider_base",
+  "norm_cfg": {
+    "type": "LN"
+  },
+  "num_heads": [
+    4,
+    8,
+    16,
+    32
+  ],
+  "out_indices": [
+    0,
+    1,
+    2,
+    3
+  ],
+  "patch_norm": true,
+  "patch_size": 4,
+  "pretrain_img_size": [
+    224,
+    224
+  ],
+  "pretrained": null,
+  "qk_scale": null,
+  "qkv_bias": true,
+  "semantic_weight": 0.2,
+  "strides": [
+    4,
+    2,
+    2,
+    2
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.44.2",
+  "use_abs_pos_embed": false,
+  "vision_width": 1024,
+  "window_size": 7,
+  "with_cp": false
+}

configuration_solider.py ADDED Viewed

	@@ -0,0 +1,80 @@

+from transformers.configuration_utils import PretrainedConfig
+BACKBONE_NAME2WIDTH = {
+    "swin_tiny_patch4_window7_224": 768,
+    "swin_small_patch4_window7_224": 768,
+    "swin_base_patch4_window7_224": 1024,
+    "solider_tiny": 768,
+    "solider_small": 768,
+    "solider_base": 1024,
+}
+class SOLIDERConfig(PretrainedConfig):
+    model_type = "swin_transformer"
+    def __init__(
+        self,
+        pretrain_img_size=224,
+        in_channels=3,
+        embed_dims=96,
+        patch_size=4,
+        window_size=7,
+        mlp_ratio=4,
+        depths=(2, 2, 6, 2),
+        num_heads=(3, 6, 12, 24),
+        strides=(4, 2, 2, 2),
+        out_indices=(0, 1, 2, 3),
+        qkv_bias=True,
+        qk_scale=None,
+        patch_norm=True,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.0,  # NOTE: I modified this from the implemenation of SOLIDER
+        use_abs_pos_embed=False,
+        act_cfg=dict(type="GELU"),
+        norm_cfg=dict(type="LN"),
+        with_cp=False,
+        pretrained=None,
+        convert_weights=False,
+        frozen_stages=-1,
+        init_cfg=None,
+        semantic_weight=0.2, # NOTE: I modified this from the implemenation of SOLIDER
+        name="solider_small",
+        **kwargs,
+    ):
+        self.pretrain_img_size = pretrain_img_size
+        self.in_channels = in_channels
+        self.embed_dims = embed_dims
+        self.patch_size = patch_size
+        self.window_size = window_size
+        self.mlp_ratio = mlp_ratio
+        self.depths = depths
+        self.num_heads = num_heads
+        self.strides = strides
+        self.out_indices = out_indices
+        self.qkv_bias = qkv_bias
+        self.qk_scale = qk_scale
+        self.patch_norm = patch_norm
+        self.drop_rate = drop_rate
+        self.attn_drop_rate = attn_drop_rate
+        self.drop_path_rate = drop_path_rate
+        self.use_abs_pos_embed = use_abs_pos_embed
+        self.act_cfg = act_cfg
+        self.norm_cfg = norm_cfg
+        self.with_cp = with_cp
+        self.pretrained = pretrained
+        self.convert_weights = convert_weights
+        self.frozen_stages = frozen_stages
+        self.init_cfg = init_cfg
+        self.semantic_weight = semantic_weight
+        # NOTE: These below attributes are just for provide information!
+        # They are not effect on model building!
+        self.img_size = pretrain_img_size
+        assert name in BACKBONE_NAME2WIDTH
+        self.name = name
+        self.vision_width = BACKBONE_NAME2WIDTH[self.name]
+        self.hidden_size = self.embed_dims
+        super().__init__(**kwargs)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f69b0364dd102a368bcf73928e645edfe6e723acb1195f18b6ab55dec4f918d
+size 347551320

modeling_solider.py ADDED Viewed

	@@ -0,0 +1,1840 @@

+import os
+import warnings
+from collections import OrderedDict
+from copy import deepcopy
+import logging
+import math
+from typing import Sequence
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as cp
+import numpy as np
+import cv2
+from dataclasses import dataclass
+from transformers import PreTrainedModel, PretrainedConfig
+from transformers import PretrainedConfig
+# from .lavis_base_model import BaseEncoder
+# from lavis.common.registry import registry
+from torch.nn import Module as BaseModule
+from torch.nn import ModuleList
+from torch.nn import Sequential
+from torch.nn import Linear
+from torch import Tensor
+from itertools import repeat
+import collections.abc
+from .configuration_solider import SOLIDERConfig, BACKBONE_NAME2WIDTH
+def _ntuple(n):
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable):
+            return x
+        return tuple(repeat(x, n))
+    return parse
+to_2tuple = _ntuple(2)
+def trunc_normal_init(
+    module: nn.Module,
+    mean: float = 0,
+    std: float = 1,
+    a: float = -2,
+    b: float = 2,
+    bias: float = 0,
+) -> None:
+    if hasattr(module, "weight") and module.weight is not None:
+        # trunc_normal_(module.weight, mean, std, a, b)  # type: ignore
+        _no_grad_trunc_normal_(module.weight, mean, std, a, b)  # type: ignore
+    if hasattr(module, "bias") and module.bias is not None:
+        nn.init.constant_(module.bias, bias)  # type: ignore
+def _no_grad_trunc_normal_(
+    tensor: Tensor, mean: float, std: float, a: float, b: float
+) -> Tensor:
+    # Method based on
+    # https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    # Modified from
+    # https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1.0 + math.erf(x / math.sqrt(2.0))) / 2.0
+    if (mean < a - 2 * std) or (mean > b + 2 * std):
+        warnings.warn(
+            "mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
+            "The distribution of values may be incorrect.",
+            stacklevel=2,
+        )
+    with torch.no_grad():
+        # Values are generated by using a truncated uniform distribution and
+        # then using the inverse CDF for the normal distribution.
+        # Get upper and lower cdf values
+        lower = norm_cdf((a - mean) / std)
+        upper = norm_cdf((b - mean) / std)
+        # Uniformly fill tensor with values from [lower, upper], then translate
+        # to [2lower-1, 2upper-1].
+        tensor.uniform_(2 * lower - 1, 2 * upper - 1)
+        # Use inverse cdf transform for normal distribution to get truncated
+        # standard normal
+        tensor.erfinv_()
+        # Transform to proper mean, std
+        tensor.mul_(std * math.sqrt(2.0))
+        tensor.add_(mean)
+        # Clamp to ensure it's in the proper range
+        tensor.clamp_(min=a, max=b)
+        return tensor
+def trunc_normal_(
+    tensor: Tensor, mean: float = 0.0, std: float = 1.0, a: float = -2.0, b: float = 2.0
+) -> Tensor:
+    r"""Fills the input Tensor with values drawn from a truncated
+    normal distribution. The values are effectively drawn from the
+    normal distribution :math:`\mathcal{N}(\text{mean}, \text{std}^2)`
+    with values outside :math:`[a, b]` redrawn until they are within
+    the bounds. The method used for generating the random values works
+    best when :math:`a \leq \text{mean} \leq b`.
+    Modified from
+    https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py
+    Args:
+        tensor (``torch.Tensor``): an n-dimensional `torch.Tensor`.
+        mean (float): the mean of the normal distribution.
+        std (float): the standard deviation of the normal distribution.
+        a (float): the minimum cutoff value.
+        b (float): the maximum cutoff value.
+    """
+    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
+def constant_init(module, val, bias=0):
+    if hasattr(module, "weight") and module.weight is not None:
+        nn.init.constant_(module.weight, val)
+    if hasattr(module, "bias") and module.bias is not None:
+        nn.init.constant_(module.bias, bias)
+def build_norm_layer(norm_cfg, embed_dims):
+    assert norm_cfg["type"] == "LN"
+    norm_layer = nn.LayerNorm(embed_dims)
+    return norm_cfg["type"], norm_layer
+class GELU(nn.Module):
+    r"""Applies the Gaussian Error Linear Units function:
+    .. math::
+        \text{GELU}(x) = x * \Phi(x)
+    where :math:`\Phi(x)` is the Cumulative Distribution Function for
+    Gaussian Distribution.
+    Shape:
+        - Input: :math:`(N, *)` where `*` means, any number of additional
+          dimensions
+        - Output: :math:`(N, *)`, same shape as the input
+    .. image:: scripts/activation_images/GELU.png
+    Examples::
+        >>> m = nn.GELU()
+        >>> input = torch.randn(2)
+        >>> output = m(input)
+    """
+    def forward(self, input):
+        return F.gelu(input)
+def build_activation_layer(act_cfg):
+    if act_cfg["type"] == "ReLU":
+        act_layer = nn.ReLU(inplace=act_cfg["inplace"])
+    elif act_cfg["type"] == "GELU":
+        act_layer = GELU()
+    return act_layer
+def build_conv_layer(
+    conv_cfg, in_channels, out_channels, kernel_size, stride, padding, dilation, bias
+):
+    conv_layer = nn.Conv2d(
+        in_channels=in_channels,
+        out_channels=out_channels,
+        kernel_size=kernel_size,
+        stride=stride,
+        padding=padding,
+        dilation=dilation,
+        bias=bias,
+    )
+    return conv_layer
+def drop_path(x, drop_prob=0.0, training=False):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of
+    residual blocks).
+    We follow the implementation
+    https://github.com/rwightman/pytorch-image-models/blob/a2727c1bf78ba0d7b5727f5f95e37fb7f8866b1f/timm/models/layers/drop.py  # noqa: E501
+    """
+    if drop_prob == 0.0 or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    # handle tensors with different dimensions, not just 4D tensors.
+    shape = (x.shape[0],) + (1,) * (x.ndim - 1)
+    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
+    output = x.div(keep_prob) * random_tensor.floor()
+    return output
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample  (when applied in main path of
+    residual blocks).
+    We follow the implementation
+    https://github.com/rwightman/pytorch-image-models/blob/a2727c1bf78ba0d7b5727f5f95e37fb7f8866b1f/timm/models/layers/drop.py  # noqa: E501
+    Args:
+        drop_prob (float): Probability of the path to be zeroed. Default: 0.1
+    """
+    def __init__(self, drop_prob=0.1):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
+def build_dropout(drop_cfg):
+    drop_layer = DropPath(drop_cfg["drop_prob"])
+    return drop_layer
+class FFN(BaseModule):
+    def __init__(
+        self,
+        embed_dims=256,
+        feedforward_channels=1024,
+        num_fcs=2,
+        act_cfg=dict(type="ReLU", inplace=True),
+        ffn_drop=0.0,
+        dropout_layer=None,
+        add_identity=True,
+        init_cfg=None,
+        **kwargs,
+    ):
+        super(FFN, self).__init__()
+        assert num_fcs >= 2, "num_fcs should be no less " f"than 2. got {num_fcs}."
+        self.embed_dims = embed_dims
+        self.feedforward_channels = feedforward_channels
+        self.num_fcs = num_fcs
+        self.act_cfg = act_cfg
+        self.activate = build_activation_layer(act_cfg)
+        layers = []
+        in_channels = embed_dims
+        for _ in range(num_fcs - 1):
+            layers.append(
+                Sequential(
+                    Linear(in_channels, feedforward_channels),
+                    self.activate,
+                    nn.Dropout(ffn_drop),
+                )
+            )
+            in_channels = feedforward_channels
+        layers.append(Linear(feedforward_channels, embed_dims))
+        layers.append(nn.Dropout(ffn_drop))
+        self.layers = Sequential(*layers)
+        self.dropout_layer = (
+            build_dropout(dropout_layer) if dropout_layer else torch.nn.Identity()
+        )
+        self.add_identity = add_identity
+    def forward(self, x, identity=None):
+        """Forward function for `FFN`.
+        The function would add x to the output tensor if residue is None.
+        """
+        out = self.layers(x)
+        if not self.add_identity:
+            return self.dropout_layer(out)
+        if identity is None:
+            identity = x
+        return identity + self.dropout_layer(out)
+def swin_converter(ckpt):
+    new_ckpt = OrderedDict()
+    def correct_unfold_reduction_order(x):
+        out_channel, in_channel = x.shape
+        x = x.reshape(out_channel, 4, in_channel // 4)
+        x = x[:, [0, 2, 1, 3], :].transpose(1, 2).reshape(out_channel, in_channel)
+        return x
+    def correct_unfold_norm_order(x):
+        in_channel = x.shape[0]
+        x = x.reshape(4, in_channel // 4)
+        x = x[[0, 2, 1, 3], :].transpose(0, 1).reshape(in_channel)
+        return x
+    for k, v in ckpt.items():
+        if k.startswith("head"):
+            continue
+        elif k.startswith("layers"):
+            new_v = v
+            if "attn." in k:
+                new_k = k.replace("attn.", "attn.w_msa.")
+            elif "mlp." in k:
+                if "mlp.fc1." in k:
+                    new_k = k.replace("mlp.fc1.", "ffn.layers.0.0.")
+                elif "mlp.fc2." in k:
+                    new_k = k.replace("mlp.fc2.", "ffn.layers.1.")
+                else:
+                    new_k = k.replace("mlp.", "ffn.")
+            elif "downsample" in k:
+                new_k = k
+                if "reduction." in k:
+                    new_v = correct_unfold_reduction_order(v)
+                elif "norm." in k:
+                    new_v = correct_unfold_norm_order(v)
+            else:
+                new_k = k
+            new_k = new_k.replace("layers", "stages", 1)
+        elif k.startswith("patch_embed"):
+            new_v = v
+            if "proj" in k:
+                new_k = k.replace("proj", "projection")
+            else:
+                new_k = k
+        else:
+            new_v = v
+            new_k = k
+        new_ckpt["backbone." + new_k] = new_v
+    return new_ckpt
+class AdaptivePadding(nn.Module):
+    """Applies padding to input (if needed) so that input can get fully covered
+    by filter you specified. It support two modes "same" and "corner". The
+    "same" mode is same with "SAME" padding mode in TensorFlow, pad zero around
+    input. The "corner"  mode would pad zero to bottom right.
+    Args:
+        kernel_size (int | tuple): Size of the kernel:
+        stride (int | tuple): Stride of the filter. Default: 1:
+        dilation (int | tuple): Spacing between kernel elements.
+            Default: 1
+        padding (str): Support "same" and "corner", "corner" mode
+            would pad zero to bottom right, and "same" mode would
+            pad zero around input. Default: "corner".
+    Example:
+        >>> kernel_size = 16
+        >>> stride = 16
+        >>> dilation = 1
+        >>> input = torch.rand(1, 1, 15, 17)
+        >>> adap_pad = AdaptivePadding(
+        >>>     kernel_size=kernel_size,
+        >>>     stride=stride,
+        >>>     dilation=dilation,
+        >>>     padding="corner")
+        >>> out = adap_pad(input)
+        >>> assert (out.shape[2], out.shape[3]) == (16, 32)
+        >>> input = torch.rand(1, 1, 16, 17)
+        >>> out = adap_pad(input)
+        >>> assert (out.shape[2], out.shape[3]) == (16, 32)
+    """
+    def __init__(self, kernel_size=1, stride=1, dilation=1, padding="corner"):
+        super(AdaptivePadding, self).__init__()
+        assert padding in ("same", "corner")
+        kernel_size = to_2tuple(kernel_size)
+        stride = to_2tuple(stride)
+        padding = to_2tuple(padding)
+        dilation = to_2tuple(dilation)
+        self.padding = padding
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.dilation = dilation
+    def get_pad_shape(self, input_shape):
+        input_h, input_w = input_shape
+        kernel_h, kernel_w = self.kernel_size
+        stride_h, stride_w = self.stride
+        output_h = math.ceil(input_h / stride_h)
+        output_w = math.ceil(input_w / stride_w)
+        pad_h = max(
+            (output_h - 1) * stride_h + (kernel_h - 1) * self.dilation[0] + 1 - input_h,
+            0,
+        )
+        pad_w = max(
+            (output_w - 1) * stride_w + (kernel_w - 1) * self.dilation[1] + 1 - input_w,
+            0,
+        )
+        return pad_h, pad_w
+    def forward(self, x):
+        B, C, h, w = x.shape
+        pad_h, pad_w = self.get_pad_shape((h, w))
+        if pad_h > 0 or pad_w > 0:
+            if self.padding == "corner":
+                return F.pad(x, [0, pad_w, 0, pad_h]).view(
+                    B, C, h + pad_h, w + pad_w
+                ), (
+                    h + pad_h,
+                    w + pad_w,
+                )
+            elif self.padding == "same":
+                return F.pad(
+                    x, [pad_w // 2, pad_w - pad_w // 2, pad_h // 2, pad_h - pad_h // 2]
+                ).view(B, C, h + pad_h, w + pad_w), (
+                    h + pad_h,
+                    w + pad_w,
+                )
+        return x, (h, w)
+class PatchEmbed(BaseModule):
+    """Image to Patch Embedding.
+    We use a conv layer to implement PatchEmbed.
+    Args:
+        in_channels (int): The num of input channels. Default: 3
+        embed_dims (int): The dimensions of embedding. Default: 768
+        conv_type (str): The config dict for embedding
+            conv layer type selection. Default: "Conv2d.
+        kernel_size (int): The kernel_size of embedding conv. Default: 16.
+        stride (int): The slide stride of embedding conv.
+            Default: None (Would be set as `kernel_size`).
+        padding (int | tuple | string ): The padding length of
+            embedding conv. When it is a string, it means the mode
+            of adaptive padding, support "same" and "corner" now.
+            Default: "corner".
+        dilation (int): The dilation rate of embedding conv. Default: 1.
+        bias (bool): Bias of embed conv. Default: True.
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Default: None.
+        input_size (int | tuple | None): The size of input, which will be
+            used to calculate the out size. Only work when `dynamic_size`
+            is False. Default: None.
+        init_cfg (`mmcv.ConfigDict`, optional): The Config for initialization.
+            Default: None.
+    """
+    def __init__(
+        self,
+        in_channels=3,
+        embed_dims=768,
+        conv_type="Conv2d",
+        kernel_size=16,
+        stride=16,
+        padding="corner",
+        dilation=1,
+        bias=True,
+        norm_cfg=None,
+        input_size=None,
+        init_cfg=None,
+    ):
+        super(PatchEmbed, self).__init__()
+        self.embed_dims = embed_dims
+        if stride is None:
+            stride = kernel_size
+        kernel_size = to_2tuple(kernel_size)
+        stride = to_2tuple(stride)
+        dilation = to_2tuple(dilation)
+        if isinstance(padding, str):
+            self.adap_padding = AdaptivePadding(
+                kernel_size=kernel_size,
+                stride=stride,
+                dilation=dilation,
+                padding=padding,
+            )
+            # disable the padding of conv
+            padding = 0
+        else:
+            self.adap_padding = None
+        padding = to_2tuple(padding)
+        self.projection = build_conv_layer(
+            dict(type=conv_type),
+            in_channels=in_channels,
+            out_channels=embed_dims,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            bias=bias,
+        )
+        if norm_cfg is not None:
+            self.norm = build_norm_layer(norm_cfg, embed_dims)[1]
+        else:
+            self.norm = None
+        if input_size:
+            input_size = to_2tuple(input_size)
+            # `init_out_size` would be used outside to
+            # calculate the num_patches
+            # when `use_abs_pos_embed` outside
+            self.init_input_size = input_size
+            if self.adap_padding:
+                pad_h, pad_w = self.adap_padding.get_pad_shape(input_size)
+                input_h, input_w = input_size
+                input_h = input_h + pad_h
+                input_w = input_w + pad_w
+                input_size = (input_h, input_w)
+            # https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
+            h_out = (
+                input_size[0] + 2 * padding[0] - dilation[0] * (kernel_size[0] - 1) - 1
+            ) // stride[0] + 1
+            w_out = (
+                input_size[1] + 2 * padding[1] - dilation[1] * (kernel_size[1] - 1) - 1
+            ) // stride[1] + 1
+            self.init_out_size = (h_out, w_out)
+        else:
+            self.init_input_size = None
+            self.init_out_size = None
+    def forward(self, x):
+        """
+        Args:
+            x (Tensor): Has shape (B, C, H, W). In most case, C is 3.
+        Returns:
+            tuple: Contains merged results and its spatial shape.
+                - x (Tensor): Has shape (B, out_h * out_w, embed_dims)
+                - out_size (tuple[int]): Spatial shape of x, arrange as
+                    (out_h, out_w).
+        """
+        if self.adap_padding:
+            x, _ = self.adap_padding(x)
+        x = self.projection(x)
+        B, C, out_h, out_w = x.shape
+        x = x.view(B, C, out_h * out_w).transpose(1, 2)
+        if self.norm is not None:
+            x = self.norm(x)
+        return x, (out_h, out_w)
+class PatchMerging(BaseModule):
+    """Merge patch feature map.
+    This layer groups feature map by kernel_size, and applies norm and linear
+    layers to the grouped feature map. Our implementation uses `nn.Unfold` to
+    merge patch, which is about 25% faster than original implementation.
+    Instead, we need to modify pretrained models for compatibility.
+    Args:
+        in_channels (int): The num of input channels.
+            to gets fully covered by filter and stride you specified..
+            Default: True.
+        out_channels (int): The num of output channels.
+        kernel_size (int | tuple, optional): the kernel size in the unfold
+            layer. Defaults to 2.
+        stride (int | tuple, optional): the stride of the sliding blocks in the
+            unfold layer. Default: None. (Would be set as `kernel_size`)
+        padding (int | tuple | string ): The padding length of
+            embedding conv. When it is a string, it means the mode
+            of adaptive padding, support "same" and "corner" now.
+            Default: "corner".
+        dilation (int | tuple, optional): dilation parameter in the unfold
+            layer. Default: 1.
+        bias (bool, optional): Whether to add bias in linear layer or not.
+            Defaults: False.
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Default: dict(type='LN').
+        init_cfg (dict, optional): The extra config for initialization.
+            Default: None.
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size=2,
+        stride=None,
+        padding="corner",
+        dilation=1,
+        bias=False,
+        norm_cfg=dict(type="LN"),
+        init_cfg=None,
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        if stride:
+            stride = stride
+        else:
+            stride = kernel_size
+        kernel_size = to_2tuple(kernel_size)
+        stride = to_2tuple(stride)
+        dilation = to_2tuple(dilation)
+        if isinstance(padding, str):
+            self.adap_padding = AdaptivePadding(
+                kernel_size=kernel_size,
+                stride=stride,
+                dilation=dilation,
+                padding=padding,
+            )
+            # disable the padding of unfold
+            padding = 0
+        else:
+            self.adap_padding = None
+        padding = to_2tuple(padding)
+        self.sampler = nn.Unfold(
+            kernel_size=kernel_size, dilation=dilation, padding=padding, stride=stride
+        )
+        sample_dim = kernel_size[0] * kernel_size[1] * in_channels
+        if norm_cfg is not None:
+            self.norm = build_norm_layer(norm_cfg, sample_dim)[1]
+        else:
+            self.norm = None
+        self.reduction = nn.Linear(sample_dim, out_channels, bias=bias)
+    def forward(self, x, input_size):
+        """
+        Args:
+            x (Tensor): Has shape (B, H*W, C_in).
+            input_size (tuple[int]): The spatial shape of x, arrange as (H, W).
+                Default: None.
+        Returns:
+            tuple: Contains merged results and its spatial shape.
+                - x (Tensor): Has shape (B, Merged_H * Merged_W, C_out)
+                - out_size (tuple[int]): Spatial shape of x, arrange as
+                    (Merged_H, Merged_W).
+        """
+        B, L, C = x.shape
+        assert isinstance(input_size, Sequence), (
+            f"Expect " f"input_size is " f"`Sequence` " f"but get {input_size}"
+        )
+        H, W = input_size
+        assert L == H * W, "input feature has wrong size"
+        x = x.view(B, H, W, C).permute([0, 3, 1, 2])  # B, C, H, W
+        # Use nn.Unfold to merge patch. About 25% faster than original method,
+        # but need to modify pretrained model for compatibility
+        if self.adap_padding:
+            x, (H, W) = self.adap_padding(x)
+        x = self.sampler(x)
+        # if kernel_size=2 and stride=2, x should has shape (B, 4*C, H/2*W/2)
+        out_h = (
+            H
+            + 2 * self.sampler.padding[0]
+            - self.sampler.dilation[0] * (self.sampler.kernel_size[0] - 1)
+            - 1
+        ) // self.sampler.stride[0] + 1
+        out_w = (
+            W
+            + 2 * self.sampler.padding[1]
+            - self.sampler.dilation[1] * (self.sampler.kernel_size[1] - 1)
+            - 1
+        ) // self.sampler.stride[1] + 1
+        x = x.view(B, C * H * W // (out_h * out_w), out_h * out_w)
+        output_size = (out_h, out_w)
+        x = x.transpose(1, 2)  # B, H/2*W/2, 4*C
+        x = self.norm(x) if self.norm else x
+        x = self.reduction(x)
+        return x, output_size
+class WindowMSA(BaseModule):
+    """Window based multi-head self-attention (W-MSA) module with relative
+    position bias.
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (tuple[int]): The height and width of the window.
+        qkv_bias (bool, optional):  If True, add a learnable bias to q, k, v.
+            Default: True.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        attn_drop_rate (float, optional): Dropout ratio of attention weight.
+            Default: 0.0
+        proj_drop_rate (float, optional): Dropout ratio of output. Default: 0.
+        init_cfg (dict | None, optional): The Config for initialization.
+            Default: None.
+    """
+    def __init__(
+        self,
+        embed_dims,
+        num_heads,
+        window_size,
+        qkv_bias=True,
+        qk_scale=None,
+        attn_drop_rate=0.0,
+        proj_drop_rate=0.0,
+        init_cfg=None,
+    ):
+        super().__init__()
+        self.embed_dims = embed_dims
+        self.window_size = window_size  # Wh, Ww
+        self.num_heads = num_heads
+        head_embed_dims = embed_dims // num_heads
+        self.scale = qk_scale or head_embed_dims**-0.5
+        self.init_cfg = init_cfg
+        # define a parameter table of relative position bias
+        self.relative_position_bias_table = nn.Parameter(
+            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads)
+        )  # 2*Wh-1 * 2*Ww-1, nH
+        # About 2x faster than original impl
+        Wh, Ww = self.window_size
+        rel_index_coords = self.double_step_seq(2 * Ww - 1, Wh, 1, Ww)
+        rel_position_index = rel_index_coords + rel_index_coords.T
+        rel_position_index = rel_position_index.flip(1).contiguous()
+        self.register_buffer("relative_position_index", rel_position_index)
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop_rate)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop_rate)
+        self.softmax = nn.Softmax(dim=-1)
+    def init_weights(self):
+        trunc_normal_(self.relative_position_bias_table, std=0.02)
+    def forward(self, x, mask, N, C, nW):
+        """
+        Args:
+            x (tensor): input features with shape of (nW*B, N, C)
+            mask (tensor | None, Optional): mask with shape of (nW,
+                Wh*Ww, Wh*Ww), value should be between (-inf, 0].
+        """
+        nWB = x.shape[0]
+        qkv = (
+            self.qkv(x)
+            .reshape(x.shape[0], N, 3, self.num_heads, C // self.num_heads)
+            .permute(2, 0, 3, 1, 4)
+        )
+        # make torchscript happy (cannot use tensor as tuple)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        q = q * self.scale
+        attn = q @ k.transpose(-2, -1)
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.view(
+                (
+                    self.window_size[0]
+                    * self.window_size[1]
+                    * self.window_size[0]
+                    * self.window_size[1],
+                )
+            )
+        ].view(
+            self.window_size[0] * self.window_size[1],
+            self.window_size[0] * self.window_size[1],
+            self.num_heads,
+        )  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(
+            2, 0, 1
+        ).contiguous()  # nH, Wh*Ww, Wh*Ww
+        attn = attn + relative_position_bias.unsqueeze(0)
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(nWB // nW, nW, self.num_heads, N, N) + mask.unsqueeze(
+                1
+            ).unsqueeze(0)
+            attn = attn.view(nWB, self.num_heads, N, N)
+        attn = self.softmax(attn)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(nWB, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+    @staticmethod
+    def double_step_seq(step1, len1, step2, len2):
+        seq1 = torch.arange(0, step1 * len1, step1)
+        seq2 = torch.arange(0, step2 * len2, step2)
+        return (seq1[:, None] + seq2[None, :]).reshape(1, -1)
+class ShiftWindowMSA(BaseModule):
+    """Shifted Window Multihead Self-Attention Module.
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window.
+        shift_size (int, optional): The shift step of each window towards
+            right-bottom. If zero, act as regular window-msa. Defaults to 0.
+        qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
+            Default: True
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Defaults: None.
+        attn_drop_rate (float, optional): Dropout ratio of attention weight.
+            Defaults: 0.
+        proj_drop_rate (float, optional): Dropout ratio of output.
+            Defaults: 0.
+        dropout_layer (dict, optional): The dropout_layer used before output.
+            Defaults: dict(type='DropPath', drop_prob=0.).
+        init_cfg (dict, optional): The extra config for initialization.
+            Default: None.
+    """
+    def __init__(
+        self,
+        embed_dims,
+        num_heads,
+        window_size,
+        shift_size=0,
+        qkv_bias=True,
+        qk_scale=None,
+        attn_drop_rate=0,
+        proj_drop_rate=0,
+        dropout_layer=dict(type="DropPath", drop_prob=0.0),
+        init_cfg=None,
+    ):
+        super().__init__()
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.h_slices = (
+            slice(0, -self.window_size),
+            slice(-self.window_size, -self.shift_size),
+            slice(-self.shift_size, None),
+        )
+        self.w_slices = (
+            slice(0, -self.window_size),
+            slice(-self.window_size, -self.shift_size),
+            slice(-self.shift_size, None),
+        )
+        assert 0 <= self.shift_size < self.window_size
+        self.w_msa = WindowMSA(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            window_size=to_2tuple(window_size),
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop_rate=attn_drop_rate,
+            proj_drop_rate=proj_drop_rate,
+            init_cfg=None,
+        )
+        self.drop = build_dropout(dropout_layer)
+    def forward(self, query, hw_shape):
+        B, L, C = query.shape
+        H, W = hw_shape
+        assert L == H * W, "input feature has wrong size"
+        query = query.view(-1, H, W, C)
+        # pad feature maps to multiples of window size
+        pad_r = (self.window_size - W % self.window_size) % self.window_size
+        pad_b = (self.window_size - H % self.window_size) % self.window_size
+        query = F.pad(query, (0, 0, 0, pad_r, 0, pad_b))
+        H_pad = H + pad_b
+        W_pad = W + pad_r
+        N = self.window_size**2
+        nW = H_pad * W_pad // N
+        # cyclic shift
+        if self.shift_size > 0:
+            shifted_query = torch.roll(
+                query, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2)
+            )
+            # calculate attention mask for SW-MSA
+            img_mask = torch.zeros((1, H_pad, W_pad, 1), device=query.device)
+            cnt = 0
+            for h in self.h_slices:
+                for w in self.w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+            # nW, window_size, window_size, 1
+            mask_windows = self.window_partition(img_mask, H_pad, W_pad, 1, nW)
+            mask_windows = mask_windows.view(nW, N)
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            attn_mask = attn_mask.masked_fill(
+                attn_mask != 0, float(-100.0)
+            ).masked_fill(attn_mask == 0, float(0.0))
+        else:
+            shifted_query = query
+            attn_mask = None
+        # nW*B, window_size, window_size, C
+        query_windows = self.window_partition(shifted_query, H_pad, W_pad, C, nW)
+        # nW*B, window_size*window_size, C
+        query_windows = query_windows.view(-1, N, C)
+        # W-MSA/SW-MSA (nW*B, window_size*window_size, C)
+        attn_windows = self.w_msa(query_windows, attn_mask, N, C, nW)
+        # merge windows
+        attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)
+        # B H' W' C
+        shifted_x = self.window_reverse(attn_windows, H_pad, W_pad, C, nW)
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            x = torch.roll(
+                shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2)
+            )
+        else:
+            x = shifted_x
+        if pad_r > 0 or pad_b:
+            x = x[:, :H, :W, :].contiguous()
+        x = x.view(-1, H * W, C)
+        x = self.drop(x)
+        return x
+    def window_reverse(self, windows, H, W, C, nW):
+        """
+        Args:
+            windows: (nW*B, window_size, window_size, C)
+            H (int): Height of image
+            W (int): Width of image
+        Returns:
+            x: (B, H, W, C)
+        """
+        window_size = self.window_size
+        x = windows.view(
+            -1, H // window_size, W // window_size, window_size, window_size, C
+        )
+        x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, H, W, C)
+        return x
+    def window_partition(self, x, H, W, C, nW):
+        """
+        Args:
+            x: (B, H, W, C)
+        Returns:
+            windows: (nW*B, window_size, window_size, C)
+        """
+        window_size = self.window_size
+        x = x.view(
+            -1,
+            H // window_size,
+            window_size,
+            W // window_size,
+            window_size,
+            C,
+        )
+        windows = x.permute(0, 1, 3, 2, 4, 5).contiguous()
+        windows = windows.view(-1, window_size, window_size, C)
+        return windows
+class SwinBlock(BaseModule):
+    """ "
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        window_size (int, optional): The local window scale. Default: 7.
+        shift (bool, optional): whether to shift window or not. Default False.
+        qkv_bias (bool, optional): enable bias for qkv if True. Default: True.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        drop_rate (float, optional): Dropout rate. Default: 0.
+        attn_drop_rate (float, optional): Attention dropout rate. Default: 0.
+        drop_path_rate (float, optional): Stochastic depth rate. Default: 0.
+        act_cfg (dict, optional): The config dict of activation function.
+            Default: dict(type='GELU').
+        norm_cfg (dict, optional): The config dict of normalization.
+            Default: dict(type='LN').
+        with_cp (bool, optional): Use checkpoint or not. Using checkpoint
+            will save some memory while slowing down the training speed.
+            Default: False.
+        init_cfg (dict | list | None, optional): The init config.
+            Default: None.
+    """
+    def __init__(
+        self,
+        embed_dims,
+        num_heads,
+        feedforward_channels,
+        window_size=7,
+        shift=False,
+        qkv_bias=True,
+        qk_scale=None,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.0,
+        act_cfg=dict(type="GELU"),
+        norm_cfg=dict(type="LN"),
+        with_cp=False,
+        init_cfg=None,
+    ):
+        super(SwinBlock, self).__init__()
+        self.init_cfg = init_cfg
+        self.with_cp = with_cp
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.attn = ShiftWindowMSA(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            window_size=window_size,
+            shift_size=window_size // 2 if shift else 0,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop_rate=attn_drop_rate,
+            proj_drop_rate=drop_rate,
+            dropout_layer=dict(type="DropPath", drop_prob=drop_path_rate),
+            init_cfg=None,
+        )
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=2,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type="DropPath", drop_prob=drop_path_rate),
+            act_cfg=act_cfg,
+            add_identity=True,
+            init_cfg=None,
+        )
+    def forward(self, x, hw_shape):
+        def _inner_forward(x):
+            identity = x
+            x = self.norm1(x)
+            x = self.attn(x, hw_shape)
+            x = x + identity
+            identity = x
+            x = self.norm2(x)
+            x = self.ffn(x, identity=identity)
+            return x
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+        return x
+class SwinBlockSequence(BaseModule):
+    """Implements one stage in Swin Transformer.
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        depth (int): The number of blocks in this stage.
+        window_size (int, optional): The local window scale. Default: 7.
+        qkv_bias (bool, optional): enable bias for qkv if True. Default: True.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        drop_rate (float, optional): Dropout rate. Default: 0.
+        attn_drop_rate (float, optional): Attention dropout rate. Default: 0.
+        drop_path_rate (float | list[float], optional): Stochastic depth
+            rate. Default: 0.
+        downsample (BaseModule | None, optional): The downsample operation
+            module. Default: None.
+        act_cfg (dict, optional): The config dict of activation function.
+            Default: dict(type='GELU').
+        norm_cfg (dict, optional): The config dict of normalization.
+            Default: dict(type='LN').
+        with_cp (bool, optional): Use checkpoint or not. Using checkpoint
+            will save some memory while slowing down the training speed.
+            Default: False.
+        init_cfg (dict | list | None, optional): The init config.
+            Default: None.
+    """
+    def __init__(
+        self,
+        embed_dims,
+        num_heads,
+        feedforward_channels,
+        depth,
+        window_size=7,
+        qkv_bias=True,
+        qk_scale=None,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.0,
+        downsample=None,
+        act_cfg=dict(type="GELU"),
+        norm_cfg=dict(type="LN"),
+        with_cp=False,
+        init_cfg=None,
+    ):
+        super().__init__()
+        if isinstance(drop_path_rate, list):
+            drop_path_rates = drop_path_rate
+            assert len(drop_path_rates) == depth
+        else:
+            drop_path_rates = [deepcopy(drop_path_rate) for _ in range(depth)]
+        self.blocks = ModuleList()
+        for i in range(depth):
+            block = SwinBlock(
+                embed_dims=embed_dims,
+                num_heads=num_heads,
+                feedforward_channels=feedforward_channels,
+                window_size=window_size,
+                shift=False if i % 2 == 0 else True,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                drop_path_rate=drop_path_rates[i],
+                act_cfg=act_cfg,
+                norm_cfg=norm_cfg,
+                with_cp=with_cp,
+                init_cfg=None,
+            )
+            self.blocks.append(block)
+        self.downsample = downsample
+    def forward(self, x, hw_shape):
+        for block in self.blocks:
+            x = block(x, hw_shape)
+        if self.downsample:
+            x_down, down_hw_shape = self.downsample(x, hw_shape)
+            return x_down, down_hw_shape, x, hw_shape
+        else:
+            return x, hw_shape, x, hw_shape
+class SwinTransformer(BaseModule):
+    """Swin Transformer
+    A PyTorch implement of : `Swin Transformer:
+    Hierarchical Vision Transformer using Shifted Windows`  -
+        https://arxiv.org/abs/2103.14030
+    Inspiration from
+    https://github.com/microsoft/Swin-Transformer
+    Args:
+        pretrain_img_size (int | tuple[int]): The size of input image when
+            pretrain. Defaults: 224.
+        in_channels (int): The num of input channels.
+            Defaults: 3.
+        embed_dims (int): The feature dimension. Default: 96.
+        patch_size (int | tuple[int]): Patch size. Default: 4.
+        window_size (int): Window size. Default: 7.
+        mlp_ratio (int): Ratio of mlp hidden dim to embedding dim.
+            Default: 4.
+        depths (tuple[int]): Depths of each Swin Transformer stage.
+            Default: (2, 2, 6, 2).
+        num_heads (tuple[int]): Parallel attention heads of each Swin
+            Transformer stage. Default: (3, 6, 12, 24).
+        strides (tuple[int]): The patch merging or patch embedding stride of
+            each Swin Transformer stage. (In swin, we set kernel size equal to
+            stride.) Default: (4, 2, 2, 2).
+        out_indices (tuple[int]): Output from which stages.
+            Default: (0, 1, 2, 3).
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key,
+            value. Default: True
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        patch_norm (bool): If add a norm layer for patch embed and patch
+            merging. Default: True.
+        drop_rate (float): Dropout rate. Defaults: 0.
+        attn_drop_rate (float): Attention dropout rate. Default: 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults: 0.1.
+        use_abs_pos_embed (bool): If True, add absolute position embedding to
+            the patch embedding. Defaults: False.
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='LN').
+        norm_cfg (dict): Config dict for normalization layer at
+            output of backone. Defaults: dict(type='LN').
+        with_cp (bool, optional): Use checkpoint or not. Using checkpoint
+            will save some memory while slowing down the training speed.
+            Default: False.
+        pretrained (str, optional): model pretrained path. Default: None.
+        convert_weights (bool): The flag indicates whether the
+            pre-trained model is from the original repo. We may need
+            to convert some keys to make it compatible.
+            Default: False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+    def __init__(
+        self,
+        pretrain_img_size=224,
+        in_channels=3,
+        embed_dims=96,
+        patch_size=4,
+        window_size=7,
+        mlp_ratio=4,
+        depths=(2, 2, 6, 2),
+        num_heads=(3, 6, 12, 24),
+        strides=(4, 2, 2, 2),
+        out_indices=(0, 1, 2, 3),
+        qkv_bias=True,
+        qk_scale=None,
+        patch_norm=True,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.1,
+        use_abs_pos_embed=False,
+        act_cfg=dict(type="GELU"),
+        norm_cfg=dict(type="LN"),
+        with_cp=False,
+        pretrained=None,
+        convert_weights=False,
+        frozen_stages=-1,
+        init_cfg=None,
+        semantic_weight=0.0,
+    ):
+        self.convert_weights = convert_weights
+        self.frozen_stages = frozen_stages
+        if isinstance(pretrain_img_size, int):
+            pretrain_img_size = to_2tuple(pretrain_img_size)
+        elif isinstance(pretrain_img_size, tuple):
+            if len(pretrain_img_size) == 1:
+                pretrain_img_size = to_2tuple(pretrain_img_size[0])
+            assert len(pretrain_img_size) == 2, (
+                f"The size of image should have length 1 or 2, "
+                f"but got {len(pretrain_img_size)}"
+            )
+        assert not (
+            init_cfg and pretrained
+        ), "init_cfg and pretrained cannot be specified at the same time"
+        if isinstance(pretrained, str):
+            warnings.warn(
+                "DeprecationWarning: pretrained is deprecated, "
+                'please use "init_cfg" instead'
+            )
+            self.init_cfg = dict(type="Pretrained", checkpoint=pretrained)
+        elif pretrained is None:
+            self.init_cfg = init_cfg
+        else:
+            raise TypeError("pretrained must be a str or None")
+        super(SwinTransformer, self).__init__()
+        num_layers = len(depths)
+        self.out_indices = out_indices
+        self.use_abs_pos_embed = use_abs_pos_embed
+        assert strides[0] == patch_size, "Use non-overlapping patch embed."
+        self.patch_embed = PatchEmbed(
+            in_channels=in_channels,
+            embed_dims=embed_dims,
+            conv_type="Conv2d",
+            kernel_size=patch_size,
+            stride=strides[0],
+            norm_cfg=norm_cfg if patch_norm else None,
+            init_cfg=None,
+        )
+        if self.use_abs_pos_embed:
+            patch_row = pretrain_img_size[0] // patch_size
+            patch_col = pretrain_img_size[1] // patch_size
+            num_patches = patch_row * patch_col
+            self.absolute_pos_embed = nn.Parameter(
+                torch.zeros((1, num_patches, embed_dims))
+            )
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+        # set stochastic depth decay rule
+        total_depth = sum(depths)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, total_depth)]
+        self.stages = ModuleList()
+        in_channels = embed_dims
+        for i in range(num_layers):
+            if i < num_layers - 1:
+                downsample = PatchMerging(
+                    in_channels=in_channels,
+                    out_channels=2 * in_channels,
+                    stride=strides[i + 1],
+                    norm_cfg=norm_cfg if patch_norm else None,
+                    init_cfg=None,
+                )
+            else:
+                downsample = None
+            stage = SwinBlockSequence(
+                embed_dims=in_channels,
+                num_heads=num_heads[i],
+                feedforward_channels=mlp_ratio * in_channels,
+                depth=depths[i],
+                window_size=window_size,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                drop_path_rate=dpr[sum(depths[:i]) : sum(depths[: i + 1])],
+                downsample=downsample,
+                act_cfg=act_cfg,
+                norm_cfg=norm_cfg,
+                with_cp=with_cp,
+                init_cfg=None,
+            )
+            self.stages.append(stage)
+            if downsample:
+                in_channels = downsample.out_channels
+        self.num_features = [int(embed_dims * 2**i) for i in range(num_layers)]
+        # Add a norm layer for each output
+        for i in out_indices:
+            layer = build_norm_layer(norm_cfg, self.num_features[i])[1]
+            layer_name = f"norm{i}"
+            self.add_module(layer_name, layer)
+        # self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.avgpool = nn.AdaptiveAvgPool1d(1)
+        # semantic embedding
+        self.semantic_weight = semantic_weight
+        if self.semantic_weight >= 0:
+            self.semantic_embed_w = ModuleList()
+            self.semantic_embed_b = ModuleList()
+            for i in range(len(depths)):
+                if i >= len(depths) - 1:
+                    i = len(depths) - 2
+                semantic_embed_w = nn.Linear(2, self.num_features[i + 1])
+                semantic_embed_b = nn.Linear(2, self.num_features[i + 1])
+                # TODO: Test with semantic embed unfreeze
+                for param in semantic_embed_w.parameters():
+                    param.requires_grad = False
+                for param in semantic_embed_b.parameters():
+                    param.requires_grad = False
+                trunc_normal_init(semantic_embed_w, std=0.02, bias=0.0)
+                trunc_normal_init(semantic_embed_b, std=0.02, bias=0.0)
+                self.semantic_embed_w.append(semantic_embed_w)
+                self.semantic_embed_b.append(semantic_embed_b)
+            self.softplus = nn.Softplus()
+    def train(self, mode=True):
+        """Convert the model into training mode while keep layers freezed."""
+        super(SwinTransformer, self).train(mode)
+        self._freeze_stages()
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+            if self.use_abs_pos_embed:
+                self.absolute_pos_embed.requires_grad = False
+            self.drop_after_pos.eval()
+        for i in range(1, self.frozen_stages + 1):
+            if (i - 1) in self.out_indices:
+                norm_layer = getattr(self, f"norm{i-1}")
+                norm_layer.eval()
+                for param in norm_layer.parameters():
+                    param.requires_grad = False
+            m = self.stages[i - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+    def init_weights(self, pretrained=None):
+        logger = logging.getLogger("loading parameters.")
+        if pretrained is None:
+            logger.warn(
+                f"No pre-trained weights for "
+                f"{self.__class__.__name__}, "
+                f"training start from scratch"
+            )
+            if self.use_abs_pos_embed:
+                trunc_normal_(self.absolute_pos_embed, std=0.02)
+            for m in self.modules():
+                if isinstance(m, nn.Linear):
+                    trunc_normal_init(m, std=0.02, bias=0.0)
+                elif isinstance(m, nn.LayerNorm):
+                    constant_init(m.bias, 0)
+                    constant_init(m.weight, 1.0)
+        else:
+            ckpt = torch.load(pretrained, map_location="cpu")
+            if "teacher" in ckpt:
+                ckpt = ckpt["teacher"]
+            if "state_dict" in ckpt:
+                _state_dict = ckpt["state_dict"]
+            elif "model" in ckpt:
+                _state_dict = ckpt["model"]
+            else:
+                _state_dict = ckpt
+            if self.convert_weights:
+                # supported loading weight from original repo,
+                _state_dict = swin_converter(_state_dict)
+            state_dict = OrderedDict()
+            for k, v in _state_dict.items():
+                if k.startswith("backbone."):
+                    state_dict[k[9:]] = v
+            # strip prefix of state_dict
+            if list(state_dict.keys())[0].startswith("module."):
+                state_dict = {k[7:]: v for k, v in state_dict.items()}
+            # reshape absolute position embedding
+            if state_dict.get("absolute_pos_embed") is not None:
+                absolute_pos_embed = state_dict["absolute_pos_embed"]
+                N1, L, C1 = absolute_pos_embed.size()
+                N2, C2, H, W = self.absolute_pos_embed.size()
+                if N1 != N2 or C1 != C2 or L != H * W:
+                    logger.warning("Error in loading absolute_pos_embed, pass")
+                else:
+                    state_dict["absolute_pos_embed"] = (
+                        absolute_pos_embed.view(N2, H, W, C2)
+                        .permute(0, 3, 1, 2)
+                        .contiguous()
+                    )
+            # interpolate position bias table if needed
+            relative_position_bias_table_keys = [
+                k for k in state_dict.keys() if "relative_position_bias_table" in k
+            ]
+            for table_key in relative_position_bias_table_keys:
+                table_pretrained = state_dict[table_key]
+                table_current = self.state_dict()[table_key]
+                L1, nH1 = table_pretrained.size()
+                L2, nH2 = table_current.size()
+                if nH1 != nH2:
+                    logger.warning(f"Error in loading {table_key}, pass")
+                elif L1 != L2:
+                    S1 = int(L1**0.5)
+                    S2 = int(L2**0.5)
+                    table_pretrained_resized = F.interpolate(
+                        table_pretrained.permute(1, 0).reshape(1, nH1, S1, S1),
+                        size=(S2, S2),
+                        mode="bicubic",
+                    )
+                    state_dict[table_key] = (
+                        table_pretrained_resized.view(nH2, L2)
+                        .permute(1, 0)
+                        .contiguous()
+                    )
+            res = self.load_state_dict(state_dict, False)
+            print("unloaded parameters:", res)
+    def forward(self, x, semantic_weight=None):
+        if self.semantic_weight >= 0 and semantic_weight == None:
+            w = torch.ones(x.shape[0], 1) * self.semantic_weight
+            w = torch.cat([w, 1 - w], axis=-1)
+            semantic_weight = w.to(x.device)
+        x, hw_shape = self.patch_embed(x)
+        if self.use_abs_pos_embed:
+            x = x + self.absolute_pos_embed
+        x = self.drop_after_pos(x)
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x, hw_shape, out, out_hw_shape = stage(x, hw_shape)
+            if self.semantic_weight >= 0:
+                sw = self.semantic_embed_w[i](semantic_weight).unsqueeze(1)
+                sb = self.semantic_embed_b[i](semantic_weight).unsqueeze(1)
+                x = x * self.softplus(sw) + sb
+            if i in self.out_indices:
+                norm_layer = getattr(self, f"norm{i}")
+                out = norm_layer(out)
+                # out = (
+                #     out.view(-1, out_hw_shape[0], out_hw_shape[1], self.num_features[i])
+                #     .permute(0, 3, 1, 2)
+                #     .contiguous()
+                # )
+                outs.append(out)
+        x = outs[-1]
+        x_cls = self.avgpool(x.transpose(1, 2))  # B C 1
+        x = torch.cat([x_cls.transpose(1, 2), x], dim=1)
+        return x
+def swin_base_patch4_window7_224(
+    img_size=224, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, **kwargs
+):
+    model = SwinTransformer(
+        pretrain_img_size=img_size,
+        patch_size=4,
+        window_size=7,
+        embed_dims=128,
+        depths=(2, 2, 18, 2),
+        num_heads=(4, 8, 16, 32),
+        drop_path_rate=drop_path_rate,
+        drop_rate=drop_rate,
+        attn_drop_rate=attn_drop_rate,
+        **kwargs,
+    )
+    return model
+def swin_small_patch4_window7_224(
+    img_size=224, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, **kwargs
+):
+    model = SwinTransformer(
+        pretrain_img_size=img_size,
+        patch_size=4,
+        window_size=7,
+        embed_dims=96,
+        depths=(2, 2, 18, 2),
+        num_heads=(3, 6, 12, 24),
+        drop_path_rate=drop_path_rate,
+        drop_rate=drop_rate,
+        attn_drop_rate=attn_drop_rate,
+        **kwargs,
+    )
+    return model
+def swin_tiny_patch4_window7_224(
+    img_size=224, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, **kwargs
+):
+    model = SwinTransformer(
+        pretrain_img_size=img_size,
+        patch_size=4,
+        window_size=7,
+        embed_dims=96,
+        depths=(2, 2, 6, 2),
+        num_heads=(3, 6, 12, 24),
+        drop_path_rate=drop_path_rate,
+        drop_rate=drop_rate,
+        attn_drop_rate=attn_drop_rate,
+        **kwargs,
+    )
+    return model
+def build_solider(cfg: dict) -> SwinTransformer:
+    name = cfg["name"]
+    img_size = cfg["img_size"]
+    # drop_path_rate = cfg["drop_path_rate"]\
+    # TODO: Test with drop_path_rate = 0.0
+    drop_path_rate = 0.1
+    # drop_rate = cfg["drop_rate"]
+    drop_rate = 0.0
+    # attn_drop_rate = cfg["attn_drop_rate"]
+    attn_drop_rate = 0.0
+    pretrained = cfg["pretrained"]
+    # convert_weights = cfg["convert_weights"]
+    convert_weights = False
+    semantic_weight = cfg["semantic_weight"]
+    if name == "swin_tiny_patch4_window7_224":
+        model = swin_tiny_patch4_window7_224(
+            img_size=img_size,
+            drop_path_rate=drop_path_rate,
+            drop_rate=drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            pretrained=pretrained,
+            convert_weights=convert_weights,
+            semantic_weight=semantic_weight,
+        )
+    elif name == "swin_small_patch4_window7_224":
+        model = swin_small_patch4_window7_224(
+            img_size=img_size,
+            drop_path_rate=drop_path_rate,
+            drop_rate=drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            pretrained=pretrained,
+            convert_weights=convert_weights,
+            semantic_weight=semantic_weight,
+        )
+    elif name == "swin_base_patch4_window7_224":
+        model = swin_base_patch4_window7_224(
+            img_size=img_size,
+            drop_path_rate=drop_path_rate,
+            drop_rate=drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            pretrained=pretrained,
+            convert_weights=convert_weights,
+            semantic_weight=semantic_weight,
+        )
+    else:
+        raise RuntimeError(f"Not support model name: {name}")
+    if pretrained != "":
+        if os.path.exists(pretrained):
+            model.init_weights(pretrained)
+        else:
+            warnings.warn(f"pretrained: {pretrained} not exists")
+    return model
+# BACKBONE_NAME2WIDTH = {
+#     "swin_tiny_patch4_window7_224": 768,
+#     "swin_small_patch4_window7_224": 768,
+#     "swin_base_patch4_window7_224": 1024,
+#     "solider_tiny": 768,
+#     "solider_small": 768,
+#     "solider_base": 1024,
+# }
+SOLIDER_BASE_MODEL_CONFIG_PARAMETERS = {
+    "pretrain_img_size": [224, 224],
+    "in_channels": 3,
+    "embed_dims": 128,
+    "patch_size": 4,
+    "window_size": 7,
+    "mlp_ratio": 4,
+    "depths": (2, 2, 18, 2),
+    "num_heads": (4, 8, 16, 32),
+    "strides": (4, 2, 2, 2),
+    "out_indices": (0, 1, 2, 3),
+    "qkv_bias": True,
+    "qk_scale": None,
+    "patch_norm": True,
+    "drop_rate": 0.0,
+    "attn_drop_rate": 0.0,
+    "drop_path_rate": 0.0,
+    "use_abs_pos_embed": False,
+    "act_cfg": dict(type="GELU"),
+    "norm_cfg": dict(type="LN"),
+    "with_cp": False,
+    "pretrained": None,
+    "convert_weights": False,
+    "frozen_stages": -1,
+    "init_cfg": None,
+    "semantic_weight": 0.2,
+    "name": "solider_base",
+}
+SOLIDER_SMALL_MODEL_CONFIG_PARAMETERS = {
+    "pretrain_img_size": [224, 224],
+    "in_channels": 3,
+    "embed_dims": 96,
+    "patch_size": 4,
+    "window_size": 7,
+    "mlp_ratio": 4,
+    "depths": (2, 2, 18, 2),
+    "num_heads": (3, 6, 12, 24),
+    "strides": (4, 2, 2, 2),
+    "out_indices": (0, 1, 2, 3),
+    "qkv_bias": True,
+    "qk_scale": None,
+    "patch_norm": True,
+    "drop_rate": 0.0,
+    "attn_drop_rate": 0.0,
+    "drop_path_rate": 0.0,
+    "use_abs_pos_embed": False,
+    "act_cfg": dict(type="GELU"),
+    "norm_cfg": dict(type="LN"),
+    "with_cp": False,
+    "pretrained": None,
+    "convert_weights": False,
+    "frozen_stages": -1,
+    "init_cfg": None,
+    "semantic_weight": 0.2,
+    "name": "solider_small",
+}
+SOLIDER_TINY_MODEL_CONFIG_PARAMETERS = {
+    "pretrain_img_size": [224, 224],
+    "in_channels": 3,
+    "embed_dims": 96,
+    "patch_size": 4,
+    "window_size": 7,
+    "mlp_ratio": 4,
+    "depths": (2, 2, 6, 2),
+    "num_heads": (3, 6, 12, 24),
+    "strides": (4, 2, 2, 2),
+    "out_indices": (0, 1, 2, 3),
+    "qkv_bias": True,
+    "qk_scale": None,
+    "patch_norm": True,
+    "drop_rate": 0.0,
+    "attn_drop_rate": 0.0,
+    "drop_path_rate": 0.0,
+    "use_abs_pos_embed": False,
+    "act_cfg": dict(type="GELU"),
+    "norm_cfg": dict(type="LN"),
+    "with_cp": False,
+    "pretrained": None,
+    "convert_weights": False,
+    "frozen_stages": -1,
+    "init_cfg": None,
+    "semantic_weight": 0.2,
+    "name": "solider_tiny",
+}
+SOLIDER_BASE_CONFIG = SOLIDERConfig(**SOLIDER_BASE_MODEL_CONFIG_PARAMETERS)
+SOLIDER_SMALL_CONFIG = SOLIDERConfig(**SOLIDER_SMALL_MODEL_CONFIG_PARAMETERS)
+SOLIDER_TINY_CONFIG = SOLIDERConfig(**SOLIDER_TINY_MODEL_CONFIG_PARAMETERS)
+def build_solider_vision_encoder(weight_path, name="swin_small_patch4_window7_224"):
+    vision_width = BACKBONE_NAME2WIDTH[name]
+    return (
+        build_solider(
+            {
+                "name": name,
+                "img_size": [384, 128],
+                "pretrained": weight_path,
+                "semantic_weight": 0.2,
+            }
+        ),
+        vision_width,
+    )
+class SOLIDERModel(PreTrainedModel):
+    config_class = SOLIDERConfig
+    base_model_prefix = "solider"
+    def __init__(self, config: SOLIDERConfig):
+        super().__init__(config)
+        self.solider = SwinTransformer(
+            pretrain_img_size=config.pretrain_img_size,
+            embed_dims=config.embed_dims,
+            patch_size=config.patch_size,
+            window_size=config.window_size,
+            mlp_ratio=config.mlp_ratio,
+            depths=config.depths,
+            num_heads=config.num_heads,
+            strides=config.strides,
+            out_indices=config.out_indices,
+            qkv_bias=config.qkv_bias,
+            qk_scale=config.qk_scale,
+            patch_norm=config.patch_norm,
+            drop_rate=config.drop_rate,
+            attn_drop_rate=config.attn_drop_rate,
+            drop_path_rate=config.drop_path_rate,
+            use_abs_pos_embed=config.use_abs_pos_embed,
+            act_cfg=config.act_cfg,
+            norm_cfg=config.norm_cfg,
+            with_cp=config.with_cp,
+            pretrained=config.pretrained,
+            convert_weights=config.convert_weights,
+            frozen_stages=config.frozen_stages,
+            init_cfg=config.init_cfg,
+            semantic_weight=config.semantic_weight,
+        )
+        self.solider_name = config.name
+        self.vision_width = BACKBONE_NAME2WIDTH[self.solider_name]
+        self.hidden_size = self.vision_width
+        self.config = config
+        # self.init_weights()
+    def forward(self, x, semantic_weight=None):
+        # if semantic_weight is None, use the default value from config
+        return self.solider(x, semantic_weight)
+class SoliderEncoder(SwinTransformer):
+    options = [
+        "swin_tiny_patch4_window7_224",
+        "swin_small_patch4_window7_224",
+        "swin_base_patch4_window7_224",
+    ]
+    @classmethod
+    def from_config(cls, cfg, from_pretrained=None):
+        name = cfg.get("name", "swin_small_patch4_window7_224")
+        img_size = cfg.get("img_size", [384, 128])
+        drop_path_rate = cfg.get("drop_path_rate", 0.1)
+        drop_rate = cfg.get("drop_rate", 0.0)
+        attn_drop_rate = cfg.get("attn_drop_rate", 0.0)
+        pretrained = cfg.get("pretrained", None)
+        convert_weights = cfg.get("convert_weights", False)
+        semantic_weight = cfg.get("semantic_weight", 0.2)
+        if name == "swin_tiny_patch4_window7_224" or name == "tiny":
+            model = swin_tiny_patch4_window7_224(
+                img_size=img_size,
+                drop_path_rate=drop_path_rate,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                pretrained=pretrained,
+                convert_weights=convert_weights,
+                semantic_weight=semantic_weight,
+            )
+        elif name == "swin_small_patch4_window7_224" or name == "small":
+            model = swin_small_patch4_window7_224(
+                img_size=img_size,
+                drop_path_rate=drop_path_rate,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                pretrained=pretrained,
+                convert_weights=convert_weights,
+                semantic_weight=semantic_weight,
+            )
+        elif name == "swin_base_patch4_window7_224" or name == "base":
+            model = swin_base_patch4_window7_224(
+                img_size=img_size,
+                drop_path_rate=drop_path_rate,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                pretrained=pretrained,
+                convert_weights=convert_weights,
+                semantic_weight=semantic_weight,
+            )
+        model.vision_width = BACKBONE_NAME2WIDTH[name]
+        if from_pretrained is not None:
+            print("begin load pretrained model solider")
+            state_dict_vision_encoder = torch.load(from_pretrained, map_location="cpu")
+            msg = model.load_state_dict(state_dict_vision_encoder)
+            print(msg)
+        return model
+    def forward_features(self, x, semantic_weight=None):
+        return SwinTransformer.forward(self, x, semantic_weight)