File size: 5,766 Bytes
6901fb9 0fa8c03 6901fb9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
license: mit
---
# BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
<div align="center" class="authors">
<a href="https://scholar.google.com/citations?user=pfXQwcQAAAAJ&hl=en" target="_blank">Xuewu Lin</a>,
<a href="https://wzmsltw.github.io/" target="_blank">Tianwei Lin</a>,
<a href="https://scholar.google.com/citations?user=F2e_jZMAAAAJ&hl=en" target="_blank">Lichao Huang</a>,
<a href="https://openreview.net/profile?id=~HONGYU_XIE2" target="_blank">Hongyu Xie</a>,
<a href="https://scholar.google.com/citations?user=HQfc8TEAAAAJ&hl=en" target="_blank">Zhizhong Su</a>
</div>
<div align="center" style="line-height: 3;">
<a href="https://github.com/HorizonRobotics/BIP3D" target="_blank" style="margin: 2px;">
<img alt="Code" src="https://img.shields.io/badge/Code-Github-bule" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://linxuewu.github.io/BIP3D-page/" target="_blank" style="margin: 2px;">
<img alt="Homepage" src="https://img.shields.io/badge/Homepage-BIP3D-green" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/xuewulin/BIP3D" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/Models-Hugging%20Face-yellow" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://arxiv.org/abs/2411.14869" target="_blank" style="margin: 2px;">
<img alt="Paper" src="https://img.shields.io/badge/Paper-Arxiv-red" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div align="center">
<img src="https://github.com/HorizonRobotics/BIP3D/raw/main/resources/bip3d_structure.png" width="90%" alt="BIP3D" />
<p style="font-size:0.8em; color:#555;">The Architecture Diagram of BIP3D, where the red stars indicate the parts that have been modified or added compared to the base model, GroundingDINO, and dashed lines indicate optional elements.</p>
</div>
## Results on EmbodiedScan Benchmark
We made several improvements based on the original paper, achieving better 3D perception results. The main improvements include the following two points:
1. **New Fusion Operation**: We enhanced the decoder by replacing the deformable aggregation (DAG) with a 3D deformable attention mechanism (DAT). Specifically, we improved the feature sampling process by transitioning from bilinear interpolation to trilinear interpolation, which leverages depth distribution for more accurate feature extraction.
2. **Mixed Data Training**: To optimize the grounding model's performance, we adopted a mixed-data training strategy by integrating detection data with grounding data during the grounding finetuning process.
### 1. Results on Multi-view 3D Detection Validation Dataset
|Model | Inputs | Op | Overall | Head | Common | Tail | Small | Medium | Large | ScanNet | 3RScan | MP3D | ckpt | log |
| :----: | :---: | :---: | :---: |:---: | :---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: | :----: | :---: |
|BIP3D | RGB | DAG | 16.57|23.29|13.84|12.29|2.67|17.85|12.89|19.71|26.76|8.50 | - | - |
|BIP3D | RGB | DAT | 16.67|22.41|14.19|13.18|3.32|17.25|14.89|20.80|24.18|9.91 | - | - |
|BIP3D |RGB-D | DAG | 22.53|28.89|20.51|17.83|6.95|24.21|15.46|24.77|35.29|10.34 | - | - |
|BIP3D |RGB-D | DAT | 23.24|31.51|20.20|17.62|7.31|24.09|15.82|26.35|36.29|11.44 | - | - |
### 2. Results on Multi-view 3D Grounding Mini Dataset
|Model | Inputs | Op | Overall | Easy | Hard | View-dep | View-indep | ScanNet | 3RScan | MP3D | ckpt | log |
| :----: | :---: | :---: | :---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: |:---: | :----: |
|BIP3D | RGB | DAG | 44.00|44.39|39.56|46.05|42.92|48.62|42.47|36.40 | - | - |
|BIP3D | RGB | DAT | 44.43|44.74|41.02|45.17|44.04|49.70|41.81|37.28 | - | - |
|BIP3D | RGB-D | DAG | 45.79|46.22|40.91|45.93|45.71|48.94|46.61|37.36 | - | - |
|BIP3D | RGB-D | DAT | 58.47|59.02|52.23|60.20|57.56|66.63|54.79|46.72 | - | - |
### 3. Results on Multi-view 3D Grounding Validation Dataset
|Model | Inputs | Op | Mixed Data | Overall | Easy | Hard | View-dep | View-indep | ScanNet | 3RScan | MP3D | ckpt | log |
| :----: | :---: | :---: | :---: |:---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: |:---: | :----: |
|BIP3D | RGB | DAG |No| 45.81|46.21|41.34|47.07|45.09|50.40|47.53|32.97 | - | - |
|BIP3D | RGB | DAT |No| 47.29|47.82|41.42|48.58|46.56|52.74|47.85|34.60 | - | - |
|BIP3D | RGB-D | DAG |No| 53.75|53.87|52.43|55.21|52.93|60.05|54.92|38.20 | - | - |
|BIP3D | RGB-D | DAT |No|61.36|61.88|55.58|62.43|60.76|66.96|62.75|46.92 | - | - |
|BIP3D | RGB-D | DAT |Yes|66.58|66.99|62.07|67.95|65.81|72.43|68.26|51.14 | - | - |
### 4. [Results on Multi-view 3D Grounding Test Dataset](https://huggingface.co/spaces/AGC2024/visual-grounding-2024)
|Model | Overall | Easy | Hard | View-dep | View-indep | ckpt | log |
| :----: | :---: | :---: | :---: | :---: | :---:| :---:|:---:|
|[EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan) | 39.67 | 40.52 | 30.24 | 39.05 | 39.94 | - | - |
|[SAG3D*](https://opendrivelab.github.io/Challenge%202024/multiview_Mi-Robot.pdf) | 46.92 | 47.72 | 38.03 | 46.31 | 47.18 | - | - |
|[DenseG*](https://opendrivelab.github.io/Challenge%202024/multiview_THU-LenovoAI.pdf) | 59.59 | 60.39 | 50.81 | 60.50 | 59.20 | - | - |
|BIP3D | 67.38 | 68.12 | 59.08 | 67.88 | 67.16 | - | - |
|BIP3D-Base | 70.53 | 71.22 | 62.91 | 70.69 | 70.47 | - | - |
## Citation
```
@article{lin2024bip3d,
title={BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence},
author={Lin, Xuewu and Lin, Tianwei and Huang, Lichao and Xie, Hongyu and Su, Zhizhong},
journal={arXiv preprint arXiv:2411.14869},
year={2024}
}
```
|