Transformers documentation

You are viewing v4.17.0 version. A newer version v4.27.2 is available.
Join the Hugging Face community

to get started

This is a recently introduced model so the API hasn’t been tested extensively. There may be some bugs or slight breaking changes to fix it in the future. If you see something strange, file a Github Issue.

## Overview

The MaskFormer model was proposed in Per-Pixel Classification is Not All You Need for Semantic Segmentation by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. MaskFormer addresses semantic segmentation with a mask classification paradigm instead of performing classic pixel-level classification.

The abstract from the paper is the following:

Tips:

• MaskFormer’s Transformer decoder is identical to the decoder of DETR. During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter use_auxilary_loss of MaskFormerConfig to True, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
• If you want to train the model in a distributed environment across multiple nodes, then one should update the get_num_masks function inside in the MaskFormerLoss class of modeling_maskformer.py. When training on multiple nodes, this should be set to the average number of target masks across all nodes, as can be seen in the original implementation here.
• One can use MaskFormerFeatureExtractor to prepare images for the model and optional targets for the model.
• To get the final segmentation, depending on the task, you can call post_process_semantic_segmentation() or post_process_panoptic_segmentation(). Both tasks can be solved using MaskFormerForInstanceSegmentation output, the latter needs an additional is_thing_map to know which instances must be merged together..

The figure below illustrates the architecture of MaskFormer. Taken from the original paper.

This model was contributed by francesco. The original code can be found here.

< >

( encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None pixel_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None transformer_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None pixel_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None transformer_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

• encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the encoder model (backbone).
• pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).
• transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Last hidden states (final feature map) of the last stage of the transformer decoder model.
• encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.
• pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.
• transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the transformer decoder at the output of each stage.
• hidden_states tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor containing encoder_hidden_states, pixel_decoder_hidden_states and decoder_hidden_states
• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights from Detr’s decoder after the attention softmax, used to compute the weighted average in the self-attention heads.

Class for outputs of MaskFormerModel. This class returns all the needed hidden states to compute the logits.

< >

( loss: typing.Optional[torch.FloatTensor] = None class_queries_logits: FloatTensor = None masks_queries_logits: FloatTensor = None auxiliary_logits: FloatTensor = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None pixel_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None transformer_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None pixel_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None transformer_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

• loss (torch.Tensor, optional) — The computed loss, returned when labels are present.
• class_queries_logits (torch.FloatTensor) — A tensor of shape (batch_size, num_queries, height, width) representing the proposed masks for each query.
• masks_queries_logits (torch.FloatTensor) — A tensor of shape (batch_size, num_queries, num_classes + 1) representing the proposed classes for each query. Note the + 1 is needed because we incorporate the null class.
• encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the encoder model (backbone).
• pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).
• transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Last hidden states (final feature map) of the last stage of the transformer decoder model.
• encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.
• pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.
• transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the transformer decoder at the output of each stage.
• hidden_states tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor containing encoder_hidden_states, pixel_decoder_hidden_states and decoder_hidden_states.
• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights from Detr’s decoder after the attention softmax, used to compute the weighted average in the self-attention heads.

This output can be directly passed to post_process_segmentation() or post_process_panoptic_segmentation() depending on the task. Please, see [~MaskFormerFeatureExtractor] for details regarding usage.

< >

( fpn_feature_size: int = 256 mask_feature_size: int = 256 no_object_weight: float = 0.1 use_auxiliary_loss: bool = False backbone_config: typing.Optional[typing.Dict] = None decoder_config: typing.Optional[typing.Dict] = None init_std: float = 0.02 init_xavier_std: float = 1.0 dice_weight: float = 1.0 cross_entropy_weight: float = 1.0 mask_weight: float = 20.0 **kwargs )

Parameters

• mask_feature_size (int, optional, defaults to 256) — The masks’ features size, this value will also be used to specify the Feature Pyramid Network features’ size.
• no_object_weight (float, optional, defaults to 0.1) — Weight to apply to the null (no object) class.
• use_auxiliary_loss(bool, optional, defaults to False) — If True MaskFormerForInstanceSegmentationOutput will contain the auxiliary losses computed using the logits from each decoder’s stage.
• backbone_config (Dict, optional) — The configuration passed to the backbone, if unset, the configuration corresponding to swin-base-patch4-window12-384 will be used.
• decoder_config (Dict, optional) — The configuration passed to the transformer decoder model, if unset the base config for detr-resnet-50 will be used.
• init_std (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
• init_xavier_std (float, optional, defaults to 1) — The scaling factor used for the Xavier initialization gain in the HM Attention map module.
• dice_weight (float, optional, defaults to 1.0) — The weight for the dice loss.
• cross_entropy_weight (float, optional, defaults to 1.0) — The weight for the cross entropy loss.
• mask_weight (float, optional, defaults to 20.0) — The weight for the mask loss.

This is the configuration class to store the configuration of a MaskFormerModel. It is used to instantiate a MaskFormer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the “facebook/maskformer-swin-base-ade” architecture trained on ADE20k-150.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Currently, MaskFormer only supports the Swin Transformer as backbone.

Examples:

>>> from transformers import MaskFormerConfig, MaskFormerModel

>>> # Accessing the model configuration
>>> configuration = model.config

#### from_backbone_and_decoder_configs

< >

( backbone_config: PretrainedConfig decoder_config: PretrainedConfig **kwargs ) MaskFormerConfig

Parameters

An instance of a configuration object

Instantiate a MaskFormerConfig (or a derived class) from a pre-trained backbone model configuration and DETR model configuration.

#### to_dict

< >

( ) Dict[str, any]

Returns

Dict[str, any]

Dictionary of all the attributes that make up this configuration instance,

Serializes this instance to a Python dictionary. Override the default to_dict().

< >

( do_resize = True size = 800 max_size = 1333 size_divisibility = 32 do_normalize = True image_mean = None image_std = None ignore_index = 255 **kwargs )

Parameters

• do_resize (bool, optional, defaults to True) — Whether to resize the input to a certain size.
• size (int, optional, defaults to 800) — Resize the input to the given size. Only has an effect if do_resize is set to True. If size is a sequence like (width, height), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).
• max_size (int, optional, defaults to 1333) — The largest size an image dimension can have (otherwise it’s capped). Only has an effect if do_resize is set to True.
• size_divisibility (int, optional, defaults to 32) — Some backbones need images divisible by a certain number. If not passed, it defaults to the value used in Swin Transformer.
• do_normalize (bool, optional, defaults to True) — Whether or not to normalize the input with mean and standard deviation.
• image_mean (int, optional, defaults to [0.485, 0.456, 0.406]) — The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean.
• image_std (int, optional, defaults to [0.229, 0.224, 0.225]) — The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the ImageNet std.
• ignore_index (int, optional, default to 255) — Value of the index (label) to ignore.

Constructs a MaskFormer feature extractor. The feature extractor can be used to prepare image(s) and optional targets for the model.

This feature extractor inherits from FeatureExtractionMixin which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

#### __call__

< >

( images: typing.Union[PIL.Image.Image, numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] annotations: typing.Union[typing.List[typing.Dict], typing.List[typing.List[typing.Dict]]] = None pad_and_return_pixel_mask: typing.Optional[bool] = True return_tensors: typing.Union[str, transformers.file_utils.TensorType, NoneType] = None **kwargs ) BatchFeature

Parameters

• images (PIL.Image.Image, np.ndarray, torch.Tensor, List[PIL.Image.Image], List[np.ndarray], List[torch.Tensor]) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width.
• annotations (Dict, List[Dict], optional) — The corresponding annotations as dictionary of numpy arrays with the following keys:

• masks (np.ndarray) The target mask of shape (num_classes, height, width).
• labels (np.ndarray) The target labels of shape (num_classes).
• pad_and_return_pixel_mask (bool, optional, defaults to True) — Whether or not to pad images up to the largest image in a batch and create a pixel mask.

If left to the default, will return a pixel mask that is:

• 1 for pixels that are real (i.e. not masked),
• return_tensors (str or TensorType, optional) — If set, will return tensors instead of NumPy arrays. If set to 'pt', return PyTorch torch.Tensor objects.

Returns

BatchFeature

A BatchFeature with the following fields:

• pixel_values — Pixel values to be fed to a model.
• pixel_mask — Pixel mask to be fed to a model (when pad_and_return_pixel_mask=True or if “pixel_mask” is in self.model_input_names).
• mask_labels — Optional mask labels of shape (batch_size, num_classes, height, width) to be fed to a model (when annotations are provided).
• class_labels — Optional class labels of shape (batch_size, num_classes) to be fed to a model (when annotations are provided).

Main method to prepare for the model one or several image(s) and optional annotations. Images are by default padded up to the largest image in a batch, and a pixel mask is created that indicates which pixels are real/which are padding.

NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass PIL images.

#### encode_inputs

< >

( pixel_values_list: typing.List[ForwardRef('torch.Tensor')] annotations: typing.Optional[typing.List[typing.Dict]] = None pad_and_return_pixel_mask: typing.Optional[bool] = True return_tensors: typing.Union[str, transformers.file_utils.TensorType, NoneType] = None ) BatchFeature

Parameters

• pixel_values_list (List[torch.Tensor]) — List of images (pixel values) to be padded. Each image should be a tensor of shape (channels, height, width).
• annotations (Dict, List[Dict], optional) — The corresponding annotations as dictionary of numpy arrays with the following keys:

• masks (np.ndarray) The target mask of shape (num_classes, height, width).
• labels (np.ndarray) The target labels of shape (num_classes).
• pad_and_return_pixel_mask (bool, optional, defaults to True) — Whether or not to pad images up to the largest image in a batch and create a pixel mask.

If left to the default, will return a pixel mask that is:

• 1 for pixels that are real (i.e. not masked),
• return_tensors (str or TensorType, optional) — If set, will return tensors instead of NumPy arrays. If set to 'pt', return PyTorch torch.Tensor objects.

Returns

BatchFeature

A BatchFeature with the following fields:

• pixel_values — Pixel values to be fed to a model.
• pixel_mask — Pixel mask to be fed to a model (when pad_and_return_pixel_mask=True or if “pixel_mask” is in self.model_input_names).
• mask_labels — Optional mask labels of shape (batch_size, num_classes, height, width) to be fed to a model (when annotations are provided).
• class_labels — Optional class labels of shape (batch_size, num_classes) to be fed to a model (when annotations are provided).

Pad images up to the largest image in a batch and create a corresponding pixel_mask.

#### post_process_segmentation

< >

( outputs: MaskFormerForInstanceSegmentationOutput target_size: typing.Tuple[int, int] = None ) torch.Tensor

Parameters

• outputs (MaskFormerForInstanceSegmentationOutput) — The outputs from MaskFormerForInstanceSegmentation.
• target_size (Tuple[int, int], optional) — If set, the masks_queries_logits will be resized to target_size.

Returns

torch.Tensor

A tensor of shape (batch_size, num_labels, height, width).

Converts the output of MaskFormerForInstanceSegmentationOutput into image segmentation predictions. Only supports PyTorch.

#### post_process_semantic_segmentation

< >

( outputs: MaskFormerForInstanceSegmentationOutput target_size: typing.Tuple[int, int] = None ) torch.Tensor

Parameters

Returns

torch.Tensor

A tensor of shape batch_size, height, width.

Converts the output of MaskFormerForInstanceSegmentationOutput into semantic segmentation predictions. Only supports PyTorch.

#### post_process_panoptic_segmentation

< >

( outputs: MaskFormerForInstanceSegmentationOutput object_mask_threshold: float = 0.8 overlap_mask_area_threshold: float = 0.8 is_thing_map: typing.Union[typing.Dict[int, bool], NoneType] = None ) List[Dict]

Parameters

• outputs (MaskFormerForInstanceSegmentationOutput) — The outputs from MaskFormerForInstanceSegmentation.
• object_mask_threshold (float, optional, defaults to 0.8) — The object mask threshold.
• overlap_mask_area_threshold (float, optional, defaults to 0.8) — The overlap mask area threshold to use.
• is_thing_map (Dict[int, bool], optional) — Dictionary mapping class indices to either True or False, depending on whether or not they are a thing. If not set, defaults to the is_thing_map of COCO panoptic.

Returns

List[Dict]

A list of dictionaries, one per image, each dictionary containing two keys:

• segmentation — a tensor of shape (height, width) where each pixel represents a segment_id.
• segments — a dictionary with the following keys
• id — an integer representing the segment_id.
• category_id — an integer representing the segment’s label.
• is_thing — a boolean, True if category_id was in is_thing_map, False otherwise.

Converts the output of MaskFormerForInstanceSegmentationOutput into image panoptic segmentation predictions. Only supports PyTorch.

< >

Parameters

• config (MaskFormerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare MaskFormer Model outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

#### forward

< >

( pixel_values: Tensor pixel_mask: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput or tuple(torch.FloatTensor)

Parameters

• pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using AutoFeatureExtractor. See AutoFeatureExtractor.__call__()for details.
• pixel_mask (torch.LongTensor of shape (batch_size, height, width), optional) — Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]:

• 1 for pixels that are real (i.e. not masked),

• output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
• output_attentions (bool, optional) — Whether or not to return the attentions tensors of Detr’s decoder attention layers.
• return_dict (bool, optional) — Whether or not to return a ~MaskFormerModelOutput instead of a plain tuple.

Returns

transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput or tuple(torch.FloatTensor)

A transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (MaskFormerConfig) and inputs.

• encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the encoder model (backbone).
• pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).
• transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Last hidden states (final feature map) of the last stage of the transformer decoder model.
• encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.
• pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.
• transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the transformer decoder at the output of each stage.
• hidden_states tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor containing encoder_hidden_states, pixel_decoder_hidden_states and decoder_hidden_states
• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights from Detr’s decoder after the attention softmax, used to compute the weighted average in the self-attention heads.

The MaskFormerModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import MaskFormerFeatureExtractor, MaskFormerModel
>>> import torch

>>> image = dataset["test"]["image"][0]

>>> inputs = feature_extractor(image, return_tensors="pt")

...     outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)


< >

#### forward

< >

( pixel_values: Tensor mask_labels: typing.Optional[torch.Tensor] = None class_labels: typing.Optional[torch.Tensor] = None pixel_mask: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput or tuple(torch.FloatTensor)

Parameters

• pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using AutoFeatureExtractor. See AutoFeatureExtractor.__call__()for details.
• pixel_mask (torch.LongTensor of shape (batch_size, height, width), optional) — Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]:

• 1 for pixels that are real (i.e. not masked),

• output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
• output_attentions (bool, optional) — Whether or not to return the attentions tensors of Detr’s decoder attention layers.
• return_dict (bool, optional) — Whether or not to return a ~MaskFormerModelOutput instead of a plain tuple.
• mask_labels (torch.FloatTensor, optional) — The target mask of shape (num_classes, height, width).
• class_labels (torch.LongTensor, optional) — The target labels of shape (num_classes).

A transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (MaskFormerConfig) and inputs.

• loss (torch.Tensor, optional) — The computed loss, returned when labels are present.
• class_queries_logits (torch.FloatTensor) — A tensor of shape (batch_size, num_queries, height, width) representing the proposed masks for each query.
• masks_queries_logits (torch.FloatTensor) — A tensor of shape (batch_size, num_queries, num_classes + 1) representing the proposed classes for each query. Note the + 1 is needed because we incorporate the null class.
• encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the encoder model (backbone).
• pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).
• transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Last hidden states (final feature map) of the last stage of the transformer decoder model.
• encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.
• pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.
• transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the transformer decoder at the output of each stage.
• hidden_states tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor containing encoder_hidden_states, pixel_decoder_hidden_states and decoder_hidden_states.
• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights from Detr’s decoder after the attention softmax, used to compute the weighted average in the self-attention heads.

The MaskFormerForInstanceSegmentation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

>>> from transformers import MaskFormerFeatureExtractor, MaskFormerForInstanceSegmentation
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = feature_extractor(images=image, return_tensors="pt")

>>> # model predicts class_queries_logits of shape (batch_size, num_queries)
>>> # and masks_queries_logits of shape (batch_size, num_queries, height, width)
>>> output = feature_extractor.post_process_panoptic_segmentation(outputs)`