Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Official model weights for the Locate-3D models and the 3D-JEPA encoders

Locate 3D

Locate 3D is a model for localizing objects in 3D scenes from referring expressions like “the small coffee table between the sofa and the lamp.” Locate 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, Locate 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.

3D-JEPA

3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds, is key to Locate 3D. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes.

Models

Locate-3D: Locate-3D model trained on public referential grounding datasets
Locate-3D+: Locate-3D model trained on public referential grounding datasets and the newly released Locate 3D Dataset
3D-JEPA: Pre-trained SSL encoder for 3D understanding

How to Use

For detailed instructions on how to load the encoder and integrate it into your downstream task, please refer to our GitHub repository.

License

The majority of locate-3 is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Pointcept is licensed under the MIT license.