Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Official model weights for the Locate-3D
models and the 3D-JEPA
encoders
Locate 3D
Locate 3D is a model for localizing objects in 3D scenes from referring expressions like โthe small coffee table between the sofa and the lamp.โ Locate 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, Locate 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.
3D-JEPA
3D-JEPA, a novel self-supervised
learning (SSL) algorithm applicable to sensor point clouds, is key to Locate 3D
. It takes as input a 3D pointcloud
featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space
is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features.
Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly
predict 3D masks and bounding boxes.
Models
- Locate-3D: Locate-3D model trained on public referential grounding datasets
- Locate-3D+: Locate-3D model trained on public referential grounding datasets and the newly released Locate 3D Dataset
- 3D-JEPA: Pre-trained SSL encoder for 3D understanding
How to Use
For detailed instructions on how to load the encoder and integrate it into your downstream task, please refer to our GitHub repository.
License
The majority of locate-3
is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Pointcept is licensed under the MIT license.
- Downloads last month
- 9