iFlyBot commited on
Commit
cf676a5
Β·
1 Parent(s): 2ea0cce

update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -5,25 +5,25 @@ license: mit
5
 
6
  # IflyBotVLM
7
 
8
- ## Introduction
9
 
10
  We introduce IflyBotVLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
11
 
12
  The architecture of IflyBotVLM is designed to realize four critical functional capabilities in the embodied domain:
13
 
14
- **Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
15
 
16
- **Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
17
 
18
- **Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
19
 
20
- **Task Planning**: Leveraging the current scene comprehension, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
21
 
22
  We anticipate that IflyBotVLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
23
 
24
  ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png)
25
 
26
- ## Model Architecture
27
 
28
  IflyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
29
 
@@ -31,7 +31,7 @@ The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead o
31
 
32
  ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/architecture.png)
33
 
34
- ## Model Performance
35
 
36
  IflyBotVLM demonstrates superior performance across various challenging benchmarks.
37
 
@@ -41,7 +41,7 @@ IflyBotVLM demonstrates superior performance across various challenging benchmar
41
 
42
  IflyBotVLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial comprehension, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
43
 
44
- ## Quick Start
45
 
46
  ### Using πŸ€— Transformers to Chat
47
 
 
5
 
6
  # IflyBotVLM
7
 
8
+ ## πŸ”₯Introduction
9
 
10
  We introduce IflyBotVLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
11
 
12
  The architecture of IflyBotVLM is designed to realize four critical functional capabilities in the embodied domain:
13
 
14
+ **🧭Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
15
 
16
+ **🎯Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
17
 
18
+ **πŸ€–Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
19
 
20
+ **πŸ“‹Task Planning**: Leveraging the current scene comprehension, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
21
 
22
  We anticipate that IflyBotVLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
23
 
24
  ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png)
25
 
26
+ ## πŸ—οΈModel Architecture
27
 
28
  IflyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
29
 
 
31
 
32
  ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/architecture.png)
33
 
34
+ ## πŸ“ŠModel Performance
35
 
36
  IflyBotVLM demonstrates superior performance across various challenging benchmarks.
37
 
 
41
 
42
  IflyBotVLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial comprehension, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
43
 
44
+ ## πŸš€Quick Start
45
 
46
  ### Using πŸ€— Transformers to Chat
47