arxiv:2309.00615

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

Published on Sep 1, 2023

· Submitted by

akhaliq on Sep 4, 2023

#2 Paper of the day

Upvote

Authors:

Ziyu Guo ,

Renrui Zhang ,

Xianzheng Ma ,

Hongsheng Li ,

Abstract

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Sep 6, 2023

Introduces Point-Bind: constructing joint embeddings between 3D and multi-modalities (guided by ImageBind) like image, language, and audio; train a 3D encoder to match frozen ImageBind multi-modal encoders, use 3D-image-text-audio pairs data; like ULIP which does CLIP alignment. Proposes Point-LLM for injecting PointBind semantics into LLMs (bridge PointBind with LLaMA and apply parameter efficient fine tuning). ImageBine aligns multiple modalities with image-centric data. Data creation: 3D image-text pairs from ULIP (text) on ShapeNet shapes (image is multi-view rendering), 3D-audio from ESC50 and ShapeNet; create dataset with 3D point cloud, image, text, and audio. Use I2P-MAE as learnable point cloud encoder, projection network into shared ImageBind space (average pool text, image, and audio) through frozen backbone; pair-wise contrastive loss. Any-to-3D by connecting decoder of text to 3D (like CLIP-Forge); 3D embedding space borrows arithematics from ImageBind; zero-shot recognition and cross-modal retrieval heads for downstream tasks. Like ImageBind-LLM, align only image and text (visual language understanding from dataset), other modalities are aligned; train a bind network to bridge ImageBind image encoder to LLaMA language space, zero initiated gating, add as word tokens; LLaMA adapter, only gating factors and bias-norm weights are trained. Add multi-modal encodings (including 3D), retrieve from visual cache model, transform using bind network, inject into LLaMA. Better performance on 3D cross-modal retrieval (compared to PointCLIP-v2, ULIP), 3D 0-shot classification (even Point-BERT encoder is decent); ablations on projection (2-layer) and 3D encoder (PointNeXt less than Point-BERT less than I2P-MAE). From CUHK, Shanghai AI Lab, Huazhong UST.