Papers
arxiv:2212.06858

LidarCLIP or: How I Learned to Talk to Point Clouds

Published on Dec 13, 2022
Authors:
,
,
,

Abstract

Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. Code and pre-trained models are available at https://github.com/atonderski/lidarclip.

Community

Proposes LidarCLIP: mapping from automotive LiDAR point clouds to pre-existing CLIP embedding space (text-image alignment). Indirectly align text and LiDAR using image as intermediary. Supervise LiDAR encoder with frozen CLIP image encoder on ONCE automotive dataset. Camera and LiDAR are calibrated, drop LiDAR points not visible in camera; pass image through frozen image encoder, LiDAR encoder (SST - single-stride sparse transformer) embeds a point cloud; MSE or cosine similarity loss (MSE preferred in ablations) for alignment; retrieval based on cosine similarity. Combine similarity scores with sum, can also fuse latent embeddings of different modalities using addition (or mean norm). Zero-shot classification by pooling voxel features (and similarity with custom prompt for CLIP text encoder). Better retrieval (in joint setting) than pure text-based method. Multiple retrieval modalities can be fused (retrieve challenging images by scenes like foggy and then object-based retrieval in LiDAR) - can be used for valuable edge/challenging cases. Also tests generation through ClipCap and CLIP-guided Stable Diffusion. Appendix has model details, more retrieval and generation results. From Zenseact, Chalmers University of Technology, Lund University.

Links: GitHub

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2212.06858 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2212.06858 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2212.06858 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.