Papers
arxiv:1812.05784

PointPillars: Fast Encoders for Object Detection from Point Clouds

Published on Dec 14, 2018
Authors:
,
,
,
,
,

Abstract

Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 - 4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.

Community

  • Introduces PointPillars: a point cloud encoding pipeline (for downstream object detection tasks); use PointNets to learn representations of point clouds (encode) organized in vertical columns (pillars); full object detection pipeline (using only LiDAR) is better than 3D and BEV (bird's eye view) fusion KITTI benchmark (better AP and mAP at higher FPS); no hand-tuning for different point cloud configurations (LiDAR or RADAR scans). For object detection in images, using focal loss with single stage (detection and classification) is better. VoxelNet applies PointNet features on voxels and uses 3D CNNs and a 2D backbone and detection head. PointPillars takes in point cloud and returns 3D bounding boxes (for cars, pedestrians, and cyclists); contains three stages: Pillar feature net (encode point cloud to sparse pseudo-image), 2D CNN backbone (learn high-level representation), detection head (detect/classify and regress 3D bounding boxes).
  • Pillar Feature Net: take points (x, y, z, and r - reflectance) in point cloud, create XY grid (each cell is a pillar), augment points in each pillar (each point also stores x, y, z distance and the x, y offset to mean of all points in pillar/center of pillar) to finally have D (which is 9) dims (per point); exploit sparsity to create a tensor of shape (D, P, N) where P is (capped) number of non-empty pillars and N is number of points per pillar (zero pad if pillar has fewer points and randomly sample to cap if pillar has too many points); linear project points (linear as 1x1 conv, BN, ReLU) to shape (C, P, N) and max-pool to (C, P); project back to (C, H, W) - H, W on XY grid/canvas known for each pillar.
  • Uses same (2D) CNN backbone as VoxelNet: top-down network (reduce resolutions) through conv layers (control stride, number of convs and output dim), and upsampling (with concat - like UNet) for increasing/restoring resolution (half of H, W) and giving output of shape (6C, H/2, W/2).
  • Uses SSD (single shot detector) detection head (conv layers as classes) with 2D IoU for XY plane (location), and Z height with Z elevation (aspects within pillar’s cell) are regression target.
  • Same loss as SECOND: residuals for x, y, z are (scaled) deviations, for w, l, h are log of ratios, and for angle is sine of difference; take smooth L1 loss; softmax classification loss for directions (for flipped boxes); focal loss for object classification. Trained on KITTI LiDAR, one network for cars and another for pedestrians and cyclists (different hyper parameters for matching/mining positives and negatives); anchor and matching strategy from VoxelNet. Data augmentations for flipping along X, and rotation and scaling.
  • Best results on KITTI BEV for cyclist, and across all three on KITTI test 3D. SubCNN is the best image-only method (pedestrian detection); F-PointNet, AVOD-FPN, and ConFuse are good LiDAR+image algorithms; SECOND and VoxelNet follow (PointPillars has highest speed with near-SOTA results overall). Used TensorRT for compiled kernels (massive speedup). Ablations done on spatial resolutions, point decorations (offset augmentations), and encodings. From nuTonomy (APTIV).

Links: PapersWithCode, GitHub

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/1812.05784 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/1812.05784 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/1812.05784 in a Space README.md to link it from this page.

Collections including this paper 1