xuewulin commited on
Commit
6901fb9
·
1 Parent(s): 58cbf31
Files changed (1) hide show
  1. README.md +78 -3
README.md CHANGED
@@ -1,3 +1,78 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
6
+
7
+ <div align="center" style="line-height: 1;">
8
+ <a href="https://github.com/HorizonRobotics/BIP3D" target="_blank" style="margin: 2px;">
9
+ <img alt="Code" src="https://img.shields.io/badge/Code-Github-bule" style="display: inline-block; vertical-align: middle;"/>
10
+ </a>
11
+ <a href="https://linxuewu.github.io/BIP3D-page/" target="_blank" style="margin: 2px;">
12
+ <img alt="Homepage" src="https://img.shields.io/badge/Homepage-BIP3D-green" style="display: inline-block; vertical-align: middle;"/>
13
+ </a>
14
+ <a href="https://huggingface.co/xuewulin/BIP3D" target="_blank" style="margin: 2px;">
15
+ <img alt="Hugging Face" src="https://img.shields.io/badge/Models-Hugging%20Face-yellow" style="display: inline-block; vertical-align: middle;"/>
16
+ </a>
17
+ <a href="https://arxiv.org/abs/2411.14869" target="_blank" style="margin: 2px;">
18
+ <img alt="Paper" src="https://img.shields.io/badge/Paper-Arxiv-red" style="display: inline-block; vertical-align: middle;"/>
19
+ </a>
20
+ </div>
21
+
22
+
23
+ <div align="center">
24
+ <img src="https://github.com/HorizonRobotics/BIP3D/raw/main/resources/bip3d_structure.png" width="90%" alt="BIP3D" />
25
+ <p style="font-size:0.8em; color:#555;">The Architecture Diagram of BIP3D, where the red stars indicate the parts that have been modified or added compared to the base model, GroundingDINO, and dashed lines indicate optional elements.</p>
26
+ </div>
27
+
28
+ ## Results on EmbodiedScan Benchmark
29
+ We made several improvements based on the original paper, achieving better 3D perception results. The main improvements include the following two points:
30
+ 1. **New Fusion Operation**: We enhanced the decoder by replacing the deformable aggregation (DAG) with a 3D deformable attention mechanism (DAT). Specifically, we improved the feature sampling process by transitioning from bilinear interpolation to trilinear interpolation, which leverages depth distribution for more accurate feature extraction.
31
+ 2. **Mixed Data Training**: To optimize the grounding model's performance, we adopted a mixed-data training strategy by integrating detection data with grounding data during the grounding finetuning process.
32
+
33
+ ### 1. Results on Multi-view 3D Detection Validation Dataset
34
+
35
+ |Model | Inputs | Op | Overall | Head | Common | Tail | Small | Medium | Large | ScanNet | 3RScan | MP3D | ckpt | log |
36
+ | :----: | :---: | :---: | :---: |:---: | :---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: | :----: | :---: |
37
+ |BIP3D | RGB | DAG | 16.57|23.29|13.84|12.29|2.67|17.85|12.89|19.71|26.76|8.50 | - | - |
38
+ |BIP3D | RGB | DAT | 16.67|22.41|14.19|13.18|3.32|17.25|14.89|20.80|24.18|9.91 | - | - |
39
+ |BIP3D |RGB-D | DAG | 22.53|28.89|20.51|17.83|6.95|24.21|15.46|24.77|35.29|10.34 | - | - |
40
+ |BIP3D |RGB-D | DAT | 23.24|31.51|20.20|17.62|7.31|24.09|15.82|26.35|36.29|11.44 | - | - |
41
+
42
+ ### 2. Results on Multi-view 3D Grounding Mini Dataset
43
+ |Model | Inputs | Op | Overall | Easy | Hard | View-dep | View-indep | ScanNet | 3RScan | MP3D | ckpt | log |
44
+ | :----: | :---: | :---: | :---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: |:---: | :----: |
45
+ |BIP3D | RGB | DAG | 44.00|44.39|39.56|46.05|42.92|48.62|42.47|36.40 | - | - |
46
+ |BIP3D | RGB | DAT | 44.43|44.74|41.02|45.17|44.04|49.70|41.81|37.28 | - | - |
47
+ |BIP3D | RGB-D | DAG | 45.79|46.22|40.91|45.93|45.71|48.94|46.61|37.36 | - | - |
48
+ |BIP3D | RGB-D | DAT | 58.47|59.02|52.23|60.20|57.56|66.63|54.79|46.72 | - | - |
49
+
50
+
51
+ ### 3. Results on Multi-view 3D Grounding Validation Dataset
52
+ |Model | Inputs | Op | Mixed Data | Overall | Easy | Hard | View-dep | View-indep | ScanNet | 3RScan | MP3D | ckpt | log |
53
+ | :----: | :---: | :---: | :---: |:---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: |:---: | :----: |
54
+ |BIP3D | RGB | DAG |No| 45.81|46.21|41.34|47.07|45.09|50.40|47.53|32.97 | - | - |
55
+ |BIP3D | RGB | DAT |No| 47.29|47.82|41.42|48.58|46.56|52.74|47.85|34.60 | - | - |
56
+ |BIP3D | RGB-D | DAG |No| 53.75|53.87|52.43|55.21|52.93|60.05|54.92|38.20 | - | - |
57
+ |BIP3D | RGB-D | DAT |No|61.36|61.88|55.58|62.43|60.76|66.96|62.75|46.92 | - | - |
58
+ |BIP3D | RGB-D | DAT |Yes|66.58|66.99|62.07|67.95|65.81|72.43|68.26|51.14 | - | - |
59
+
60
+
61
+ ### 4. [Results on Multi-view 3D Grounding Test Dataset](https://huggingface.co/spaces/AGC2024/visual-grounding-2024)
62
+ |Model | Overall | Easy | Hard | View-dep | View-indep | ckpt | log |
63
+ | :----: | :---: | :---: | :---: | :---: | :---:| :---:|:---:|
64
+ |[EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan) | 39.67 | 40.52 | 30.24 | 39.05 | 39.94 | - | - |
65
+ |[SAG3D*](https://opendrivelab.github.io/Challenge%202024/multiview_Mi-Robot.pdf) | 46.92 | 47.72 | 38.03 | 46.31 | 47.18 | - | - |
66
+ |[DenseG*](https://opendrivelab.github.io/Challenge%202024/multiview_THU-LenovoAI.pdf) | 59.59 | 60.39 | 50.81 | 60.50 | 59.20 | - | - |
67
+ |BIP3D | 67.38 | 68.12 | 59.08 | 67.88 | 67.16 | - | - |
68
+ |BIP3D-Base | 70.53 | 71.22 | 62.91 | 70.69 | 70.47 | - | - |
69
+
70
+ ## Citation
71
+ ```
72
+ @article{lin2024bip3d,
73
+ title={BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence},
74
+ author={Lin, Xuewu and Lin, Tianwei and Huang, Lichao and Xie, Hongyu and Su, Zhizhong},
75
+ journal={arXiv preprint arXiv:2411.14869},
76
+ year={2024}
77
+ }
78
+ ```