File size: 8,690 Bytes
506da10 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# Axial-DeepLab
Axial-DeepLab, improving over Panoptic-DeepLab, incorporates the powerful
axial self-attention modules [1], also known as the encoder of Axial
Transformers [2], for general dense prediction tasks. In this document,
we demonstrate the effectiveness of Axial-DeepLab on the task of panoptic
segmentation [6], unifying semantic segmentation and instance segmentation.
To reduce the computation complexity of 2D self-attention (especially
prominent for dense pixel prediction tasks) and further to allow us to
perform attention witin a larger or even global region, we factorize the 2D
self-attention [1, 3, 4] into **two** 1D self-attention [2, 5]. We then
effectively integrate the **axial-attention** into a residual block [7], as
illustrated in Fig. 1.
<p align="center">
<img src="../img/axial_deeplab/axial_block.png" width=800>
<br>
<em>Figure 1. An axial-attention (residual) block, which consists of two
axial-attention layers operating along height- and width-axis
sequentially.</em>
</p>
The backbone of Axial-DeepLab, called Axial-ResNet, is obtained by replacing
the residual blocks in any type of ResNets (e.g., Wide ResNets [8, 9]) with
our proposed axial-attention blocks. Optionally, one could stack only the
axial-attention blocks to form an **axial** stand-alone self-attention
backbone. However, considering a better speed-accuracy trade-off
(convolutions are typically well-optimized on modern accelerators), we
adopt the hybrid CNN-Transformer architecture, where we stack the effective
**axial-attention blocks** on top of the first few stages of ResNets (e.g.,
Wide ResNets). In particular, in this document, we explore the case where
we stack the axial-attention blocks after the *conv3_x*, i.e., we apply
axial-attentions after (and *including*) stride 16 feature maps. This
hybrid CNN-Transformer architecture is very effective on panoptic
segmentation tasks as shown in the Model Zoo below.
Additionally, we propose a position-sensitive self-attention design,
which captures long range interactions with precise positional information.
We illustrate the difference between our design and the popular non-local
block in Fig. 2.
<p align="center">
<img src="../img/axial_deeplab/nonlocal_block.png" height=250>
<img src="../img/axial_deeplab/position_sensitive_axial_block.png" height=250>
</p>
<center><em>Figure 2. A non-local block (left) vs. our position-sensitive
axial-attention applied along the width-axis (right). $$\otimes$$ denotes
matrix multiplication, and $$\oplus$$ denotes elementwise sum. The softmax
is performed on the last axis. Blue boxes denote 1 × 1 convolutions, and
red boxes denote relative positionalencoding.</em></center>
## Prerequisite
1. Make sure the software is properly [installed](../setup/installation.md).
2. Make sure the target dataset is correctly prepared (e.g.,
[Cityscapes](../setup/cityscapes.md)).
3. Download the ImageNet pretrained
[checkpoints](./imagenet_pretrained_checkpoints.md), and update the
`initial_checkpoint` path in the config files.
## Model Zoo
In the Model Zoo, we explore building axial-attention blocks on top of
SWideRNet (Scaling Wide ResNets) and MaX-DeepLab backbones (i.e., only
the ImageNet pretrained backbone without any *Mask Transformers*).
Herein, we highlight some of the employed backbones:
1. **Axial-SWideRNet-(1, 1, x)**, where x = $$\{1, 3, 4.5\}$$, scaling the
backbone layers (excluding the stem) of Wide-ResNet-41 by a factor of x. This
backbone augments the naive SWideRNet (i.e., no Squeeze-and-Excitation
or Switchable Atrous Convolution) with axial-attention blocks in the last
two stages.
2. **MaX-DeepLab-S-Backbone**: The ImageNet pretrained backbone of
MaX-DeepLab-S (i.e., without any *Mask Transformers*). This backbone augments
the ResNet-50-Beta (i.e., replacing the original stem with Inception stem)
with axial-attention blocks in the last two stages.
3. **MaX-DeepLab-L-Backbone**: The ImageNet pretrained backbone of
MaX-DeepLab-L (i.e., without any *Mask Transformers*). This backbone adds a
stacked decoder on top of the Wide ResNet-41, and incorporates
axial-attention blocks to all feature maps with output stride 16 and larger.
#### Cityscapes Panoptic Segmentation
We provide checkpoints pretrained on Cityscapes train-fine set below. If you
would like to train those models by yourself, please find the corresponding
config files under this [directory](../../configs/cityscapes/axial_deeplab).
All the reported results are obtained by *single-scale* inference and
*ImageNet-1K* pretrained checkpoints.
Backbone | Output stride | Input resolution | PQ [*] | mIoU [*] | PQ [**] | mIoU [**] | AP<sup>Mask</sup> [**]
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-----------: | :---------------: | :----: | :------: | :-----: | :-------: | :--------------------:
Axial-SWideRNet-(1, 1, 1) ([config](../../configs/cityscapes/axial_deeplab/axial_swidernet_1_1_1_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/axial_swidernet_1_1_1_os16_axial_deeplab_cityscapes_trainfine.tar.gz)) | 16 | 1025 x 2049 | 66.1 | 82.8 | 66.63 | 83.43 | 37.18
Axial-SWideRNet-(1, 1, 3) ([config](../../configs/cityscapes/axial_deeplab/axial_swidernet_1_1_3_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/axial_swidernet_1_1_3_os16_axial_deeplab_cityscapes_trainfine.tar.gz)) | 16 | 1025 x 2049 | 67.1 | 83.5 | 67.63 | 83.97 | 40.00
Axial-SWideRNet-(1, 1, 4.5) ([config](../../configs/cityscapes/axial_deeplab/axial_swidernet_1_1_4.5_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/axial_swidernet_1_1_4.5_os16_axial_deeplab_cityscapes_trainfine.tar.gz)) | 16 | 1025 x 2049 | 68.0 | 83.0 | 68.53 | 83.49 | 39.51
MaX-DeepLab-S-Backbone ([config](../../configs/cityscapes/axial_deeplab/max_deeplab_s_backbone_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/max_deeplab_s_backbone_os16_axial_deeplab_cityscapes_trainfine.tar.gz)) | 16 | 1025 x 2049 | 64.5 | 82.2 | 64.97 | 82.63 | 35.55
MaX-DeepLab-L-Backbone ([config](../../configs/cityscapes/axial_deeplab/max_deeplab_l_backbone_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/max_deeplab_l_backbone_os16_axial_deeplab_cityscapes_trainfine.tar.gz)) | 16 | 1025 x 2049 | 66.3 | 83.1 | 66.77 | 83.67 | 38.09
[*]: Results evaluated by the official script. Instance segmentation evaluation
is not supported yet (need to convert our prediction format).
[**]: Results evaluated by our pipeline. See Q4 in [FAQ](../faq.md).
## Citing Axial-DeepLab
If you find this code helpful in your research or wish to refer to the baseline
results, please use the following BibTeX entry.
* Axial-DeepLab:
```
@inproceedings{axial_deeplab_2020,
author={Huiyu Wang and Yukun Zhu and Bradley Green and Hartwig Adam and Alan Yuille and Liang-Chieh Chen},
title={{Axial-DeepLab}: Stand-Alone Axial-Attention for Panoptic Segmentation},
booktitle={ECCV},
year={2020}
}
```
* Panoptic-DeepLab:
```
@inproceedings{panoptic_deeplab_2020,
author={Bowen Cheng and Maxwell D Collins and Yukun Zhu and Ting Liu and Thomas S Huang and Hartwig Adam and Liang-Chieh Chen},
title={{Panoptic-DeepLab}: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation},
booktitle={CVPR},
year={2020}
}
```
If you use the SWideRNet backbone w/ axial attention, please consider
citing
* SWideRNet:
```
@article{swidernet_2020,
title={Scaling Wide Residual Networks for Panoptic Segmentation},
author={Chen, Liang-Chieh and Wang, Huiyu and Qiao, Siyuan},
journal={arXiv:2011.11675},
year={2020}
}
```
If you use the MaX-DeepLab-{S,L} backbone, please consider
citing
* MaX-DeepLab:
```
@inproceedings{max_deeplab_2021,
author={Huiyu Wang and Yukun Zhu and Hartwig Adam and Alan Yuille and Liang-Chieh Chen},
title={{MaX-DeepLab}: End-to-End Panoptic Segmentation with Mask Transformers},
booktitle={CVPR},
year={2021}
}
```
|