Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Published on Apr 14


The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets, but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach. The code and models are available at .


Introduces Swin3D: a 3D Swin Transformer backbone (local window self-attention over non-overlapping patches) for efficient self-attention on sparse voxels; contextual relative positional embeddings for point signals; pretrained on Structure3D (larger than ScanNet dataset) for downstream segmentation and detection tasks (on 3D point datasets). Existing 3D backbones: contrastive learning like PointContrast and DepthContrast, masked signal modeling like BEiT and MAE, and recent image and CLIP backbones on ScanNet and ShapeNet data. Direct application of Swin Transformer has issues: quadratic memory complexity and signal irregularity due to variation of voxels in a window. Architecture has voxelization (discretize point cloud into multi-scale sparse voxels), initial feature embeddings (at finest voxels), Swin3D block (for SA), downsampling; operates multi-scale/stage. Point cloud is set of 6D points (RGB color and 3D position); voxelization creates five levels (fine is 2 cm voxel size, higher levels have double size), randomly select representative for finest voxel, coarser voxels take representative closer to center; initial feature embedding through sparse convolution, BN, and ReLU for finest voxel, concat with positional offset. Swin3D has split to non-overlapping (and then half-window shifts); memory efficient SA can move the value bit/projection/component into the numerator and unwrap softmax (from vanilla SA) so that storage/memory complexity is lower; contextual relative signal encoding (extension of cRPE with signals like RGB) extends attention to include difference between voxel signals (trainable functions map signal difference to key and query space for each head), another addition (with projection) for value projection (when retrieving in SA); uses lookup tables and index functions for speed-up. Downsampling through LayerNorm and FC with max-pooled kNN voxel features. Comparisons with Stratified Transformer (with revisions) and ablations with wider, deeper, more heads, and window size; proposes Swin3D-S and Swin3D-L networks. Has lower forward-backward iteration (inference and training) time with lesser memory footprint. 3D semantic segmentation (CE/classification losses) as pre-training task; specific decoders for downstream semantic segmentation (Sein3D in decoder) and 3D object detection (replace encoder in FCAF3D and CAGroup3D with Swin3D). Tested on ScanNet segmentation and S3DIS segmentation; best in supervised and pre-training segments; also has category-wise IoUs (and 6-fold results). Better mAP (3D detection results). Ablations on scalability of model, training data, cRSE (with signal context including color and normals), and backbone (tried PointTransformerV2). Appendix has improvements: kernel scheduling (using CUDA), half-precision support, computation speedups and atomic operations. From Tsinghua, Microsoft, Peking.

Links: website, (Prior art: Stratified transformer), PapersWithCode, GitHub

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.