giantmonkeyTC
2344
34d1f8b

4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

Abstract

In many robotics and VR/AR applications, 3D-videos are readily-available sources of input (a continuous sequence of depth images, or LIDAR scans). However, those 3D-videos are processed frame-by-frame either through 2D convnets or 3D perception algorithms. In this work, we propose 4-dimensional convolutional neural networks for spatio-temporal perception that can directly process such 3D-videos using high-dimensional convolutions. For this, we adopt sparse tensors and propose the generalized sparse convolution that encompasses all discrete convolutions. To implement the generalized sparse convolution, we create an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks. We create 4D spatio-temporal convolutional neural networks using the library and validate them on various 3D semantic segmentation benchmarks and proposed 4D datasets for 3D-video perception. To overcome challenges in the 4D space, we propose the hybrid kernel, a special case of the generalized sparse convolution, and the trilateral-stationary conditional random field that enforces spatio-temporal consistency in the 7D space-time-chroma space. Experimentally, we show that convolutional neural networks with only generalized 3D sparse convolutions can outperform 2D or 2D-3D hybrid methods by a large margin. Also, we show that on 3D-videos, 4D spatio-temporal convolutional neural networks are robust to noise, outperform 3D convolutional neural networks and are faster than the 3D counterpart in some cases.

Introduction

We implement MinkUNet with TorchSparse / Minkowski Engine / Spconv backend and provide the result and checkpoints on SemanticKITTI datasets.

Results and models

SemanticKITTI

Method Backend Lr schd Amp Laser-Polar Mix Mem (GB) Training Time (hours) FPS mIoU Download
MinkUNet18-W16 torchsparse 15e βœ” βœ— 3.4 - - 60.3 model | log
MinkUNet18-W20 torchsparse 15e βœ” βœ— 3.7 - - 61.6 model | log
MinkUNet18-W32 torchsparse 15e βœ” βœ— 4.9 - - 63.1 model | log
MinkUNet34-W32 minkowski engine 3x βœ— βœ” 11.5 6.5 12.2 69.2 model | log
MinkUNet34-W32 spconv 3x βœ” βœ” 6.7 2 14.6* 68.3 model | log
MinkUNet34-W32 spconv 3x βœ— βœ” 10.5 6 14.5 69.3 model | log
MinkUNet34-W32 torchsparse 3x βœ” βœ” 6.6 3 12.8 69.3 model | log
MinkUNet34-W32 torchsparse 3x βœ— βœ” 11.8 5.5 15.9 68.7 model | log
MinkUNet34v2-W32 torchsparse 3x βœ” βœ” 8.9 - - 70.3 model | log

Note: We follow the implementation in SPVNAS original repo and W16\W20\W32 indicates different number of channels.

Note: Due to TorchSparse backend, the model performance is unstable with TorchSparse backend and may fluctuate by about 1.5 mIoU for different random seeds.

Note: Referring to PCSeg, MinkUNet34v2 is modified based on MinkUNet34.

Note*: Training Time and FPS are measured on NVIDIA A100. The versions of Torchsparse, Minkowski Engine and Spconv are 0.5.4, 1.4.0 and 2.3.6 respectively. Since spconv 2.3.6 has a bug with fp16 on in the inference stage, the actual FPS measurement using fp32.

Citation

@inproceedings{choy20194d,
  title={4d spatio-temporal convnets: Minkowski convolutional neural networks},
  author={Choy, Christopher and Gwak, JunYoung and Savarese, Silvio},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={3075--3084},
  year={2019}
}