iFlyBot commited on
Commit
fa6c0d9
·
verified ·
1 Parent(s): acf53ee

Update README.md

Browse files

add arxiv and github io

Files changed (1) hide show
  1. README.md +216 -214
README.md CHANGED
@@ -1,214 +1,216 @@
1
- ---
2
- license: mit
3
- ---
4
-
5
-
6
- # iFlyBot-VLM
7
-
8
- ## 🔥Introduction
9
-
10
- We introduce iFlyBot-VLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
11
-
12
- The architecture of iFlyBot-VLM is designed to realize four critical functional capabilities in the embodied domain:
13
- **🧭Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
14
- **🎯Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
15
- **🤖Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
16
- **📋Task Planning**: Leveraging the current scene Understanding, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
17
-
18
- We anticipate that iFlyBot-VLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
19
-
20
-
21
- <div style="display: flex; gap: 1em; max-width: 100%;">
22
- <img
23
- src="https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/smart_donut_chart.png"
24
- style="flex: 1; max-width: 60%; height: auto; object-fit: contain;"
25
- alt="iFlyBotVLM Traning Data"
26
- >
27
- <img
28
- src="https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/radar_performance.png"
29
- style="flex: 1; max-width: 40%; height: auto; object-fit: contain;"
30
- alt="iFlyBotVLM Performance"
31
- >
32
- </div>
33
-
34
-
35
- ## 🏗️Model Architecture
36
-
37
- iFlyBot-VLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
38
-
39
- The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original 448 dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from 448 to an enriched dimension of 896. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
40
-
41
- ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/architecture.png)
42
-
43
- ## 📊Model Performance
44
-
45
- iFlyBot-VLM demonstrates superior performance across various challenging benchmarks.
46
-
47
- ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/benchmark_performance.png)
48
-
49
- ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/table-performances.png)
50
-
51
- iFlyBot-VLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial Understanding, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
52
-
53
- ## 🚀Quick Start
54
-
55
- ### Using 🤗 Transformers to Chat
56
-
57
- We provide an example code to run `iFlyBot-VLM-8B` using `transformers`.
58
-
59
- > Please use transformers>=4.37.2 to ensure the model works normally.
60
-
61
- <details>
62
- <summary>Python code</summary>
63
-
64
- ```python
65
- import math
66
- import numpy as np
67
- import torch
68
- import torchvision.transforms as T
69
- from PIL import Image
70
- from torchvision.transforms.functional import InterpolationMode
71
- from transformers import AutoModel, AutoTokenizer,AutoConfig
72
- from tqdm import tqdm
73
- import json
74
-
75
-
76
- IMAGENET_MEAN = (0.485, 0.456, 0.406)
77
- IMAGENET_STD = (0.229, 0.224, 0.225)
78
-
79
- class IflyRoboInference:
80
- def __init__(self, model_path=''):
81
- self.model = AutoModel.from_pretrained(
82
- model_path,
83
- torch_dtype=torch.bfloat16,
84
- load_in_8bit=False,
85
- low_cpu_mem_usage=True,
86
- use_flash_attn=True,
87
- trust_remote_code=True,
88
- device_map="balanced").eval() # "auto", "balanced"
89
- self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
90
- self.generation_config = dict(
91
- do_sample=True,
92
- temperature=0.5,
93
- top_p = 0.0,
94
- top_k = 1,
95
- max_new_tokens=16384
96
- )
97
-
98
- def build_transform(self, input_size):
99
- MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
100
- transform = T.Compose([
101
- T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
102
- T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
103
- T.ToTensor(),
104
- T.Normalize(mean=MEAN, std=STD)
105
- ])
106
- return transform
107
-
108
- def find_closest_aspect_ratio(self, aspect_ratio, target_ratios, width, height, image_size):
109
- best_ratio_diff = float('inf')
110
- best_ratio = (1, 1)
111
- area = width * height
112
- for ratio in target_ratios:
113
- target_aspect_ratio = ratio[0] / ratio[1]
114
- ratio_diff = abs(aspect_ratio - target_aspect_ratio)
115
- if ratio_diff < best_ratio_diff:
116
- best_ratio_diff = ratio_diff
117
- best_ratio = ratio
118
- elif ratio_diff == best_ratio_diff:
119
- if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
120
- best_ratio = ratio
121
- return best_ratio
122
-
123
- def dynamic_preprocess(self, image, min_num=1, max_num=12, image_size=896, use_thumbnail=False):
124
- orig_width, orig_height = image.size
125
- aspect_ratio = orig_width / orig_height
126
-
127
- target_ratios = set(
128
- (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
129
- i * j <= max_num and i * j >= min_num)
130
- target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
131
-
132
- target_aspect_ratio = self.find_closest_aspect_ratio(
133
- aspect_ratio, target_ratios, orig_width, orig_height, image_size)
134
-
135
- target_width = image_size * target_aspect_ratio[0]
136
- target_height = image_size * target_aspect_ratio[1]
137
- blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
138
-
139
- resized_img = image.resize((target_width, target_height))
140
- processed_images = []
141
- for i in range(blocks):
142
- box = (
143
- (i % (target_width // image_size)) * image_size,
144
- (i // (target_width // image_size)) * image_size,
145
- ((i % (target_width // image_size)) + 1) * image_size,
146
- ((i // (target_width // image_size)) + 1) * image_size
147
- )
148
- split_img = resized_img.crop(box)
149
- processed_images.append(split_img)
150
- assert len(processed_images) == blocks
151
- if use_thumbnail and len(processed_images) != 1:
152
- thumbnail_img = image.resize((image_size, image_size))
153
- processed_images.append(thumbnail_img)
154
- return processed_images
155
-
156
- def load_image(self, image_file, input_size=896, max_num=12):
157
- image = Image.open(image_file).convert('RGB')
158
- transform = self.build_transform(input_size=input_size)
159
- images = self.dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
160
- pixel_values = [transform(image) for image in images]
161
- pixel_values = torch.stack(pixel_values)
162
- return pixel_values
163
-
164
-
165
- def forward_multi_image(self, image_paths: list, question: dict):
166
- pixel_values = []
167
- num_patches_list = []
168
- resize_size = 448
169
- for i, image_path in enumerate(image_paths):
170
- pixel_value = self.load_image(image_path, input_size=resize_size).to(torch.bfloat16).cuda()
171
- pixel_values.append(pixel_value)
172
- num_patches_list.append(pixel_value.size(0))
173
- pixel_values = torch.cat(tuple(pixel_values), dim=0)
174
- print(question)
175
- response, history = self.model.chat(self.tokenizer, pixel_values, question["prompt"], self.generation_config, history=None, return_history=True)
176
- print(response)
177
-
178
-
179
- def test_spatial_from_blink():
180
- hf_path = "iFlyBot/iFlyBotVLM"
181
- ifly_robo_infer = IflyRoboInference(hf_path)
182
- question = {
183
- "idx": "val_Spatial_Relation_143",
184
- "sub_task" : "Spatial Relation",
185
- "prompt": "<image> Is the person behind the cup?\nSelect from the following choices.\n(A) yes\n(B) no.\nPlease answer directly with only the letter of the correct option and nothing else."
186
- }
187
- image_path = [
188
- "./examples-images/val_Spatial_Relation_143_1.jpg"
189
- ]
190
- ifly_robo_infer.forward_multi_image(image_path, question)
191
-
192
-
193
- def test_visual_correspondence_from_blink():
194
- hf_path = "iFlyBot/iFlyBotVLM"
195
- ifly_robo_infer = IflyRoboInference(hf_path)
196
- question = {
197
- "idx": "val_Visual_Correspondence_1",
198
- "sub_task" : "Visual Correspondence",
199
- "prompt": "<image> <image> A point is circled on the first image, labeled with REF. We change the camera position or lighting and shoot the second image. You are given multiple red-circled points on the second image, choices of \"A, B, C, D\" are drawn beside each circle. Which point on the second image corresponds to the point in the first image? Select from the following options.\n(A) Point A\n(B) Point B\n(C) Point C\n(D) Point D.\nPlease answer directly with only the letter of the correct option and nothing else."
200
- }
201
- image_path = [
202
- "./examples-images/val_Visual_Correspondence_1_1.jpg",
203
- "./examples-images/val_Visual_Correspondence_1_2.jpg"
204
- ]
205
- ifly_robo_infer.forward_multi_image(image_path, question)
206
-
207
-
208
- if __name__ == '__main__':
209
- test_spatial_from_blink()
210
- test_visual_correspondence_from_blink()
211
- test_task_plan_from_egoplan2()
212
- ```
213
-
214
- </details>
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+
6
+ # iFlyBot-VLM
7
+ [arxiv](https://arxiv.org/abs/2511.04976)
8
+ [iFlyBot-VLA.github.io](https://xuwenjie401.github.io/iFlyBot-VLA.github.io/)
9
+
10
+ ## 🔥Introduction
11
+
12
+ We introduce iFlyBot-VLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
13
+
14
+ The architecture of iFlyBot-VLM is designed to realize four critical functional capabilities in the embodied domain:
15
+ **🧭Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
16
+ **🎯Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
17
+ **🤖Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
18
+ **📋Task Planning**: Leveraging the current scene Understanding, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
19
+
20
+ We anticipate that iFlyBot-VLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
21
+
22
+
23
+ <div style="display: flex; gap: 1em; max-width: 100%;">
24
+ <img
25
+ src="https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/smart_donut_chart.png"
26
+ style="flex: 1; max-width: 60%; height: auto; object-fit: contain;"
27
+ alt="iFlyBotVLM Traning Data"
28
+ >
29
+ <img
30
+ src="https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/radar_performance.png"
31
+ style="flex: 1; max-width: 40%; height: auto; object-fit: contain;"
32
+ alt="iFlyBotVLM Performance"
33
+ >
34
+ </div>
35
+
36
+
37
+ ## 🏗️Model Architecture
38
+
39
+ iFlyBot-VLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
40
+
41
+ The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original 448 dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from 448 to an enriched dimension of 896. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
42
+
43
+ ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/architecture.png)
44
+
45
+ ## 📊Model Performance
46
+
47
+ iFlyBot-VLM demonstrates superior performance across various challenging benchmarks.
48
+
49
+ ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/benchmark_performance.png)
50
+
51
+ ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/table-performances.png)
52
+
53
+ iFlyBot-VLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial Understanding, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
54
+
55
+ ## 🚀Quick Start
56
+
57
+ ### Using 🤗 Transformers to Chat
58
+
59
+ We provide an example code to run `iFlyBot-VLM-8B` using `transformers`.
60
+
61
+ > Please use transformers>=4.37.2 to ensure the model works normally.
62
+
63
+ <details>
64
+ <summary>Python code</summary>
65
+
66
+ ```python
67
+ import math
68
+ import numpy as np
69
+ import torch
70
+ import torchvision.transforms as T
71
+ from PIL import Image
72
+ from torchvision.transforms.functional import InterpolationMode
73
+ from transformers import AutoModel, AutoTokenizer,AutoConfig
74
+ from tqdm import tqdm
75
+ import json
76
+
77
+
78
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
79
+ IMAGENET_STD = (0.229, 0.224, 0.225)
80
+
81
+ class IflyRoboInference:
82
+ def __init__(self, model_path=''):
83
+ self.model = AutoModel.from_pretrained(
84
+ model_path,
85
+ torch_dtype=torch.bfloat16,
86
+ load_in_8bit=False,
87
+ low_cpu_mem_usage=True,
88
+ use_flash_attn=True,
89
+ trust_remote_code=True,
90
+ device_map="balanced").eval() # "auto", "balanced"
91
+ self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
92
+ self.generation_config = dict(
93
+ do_sample=True,
94
+ temperature=0.5,
95
+ top_p = 0.0,
96
+ top_k = 1,
97
+ max_new_tokens=16384
98
+ )
99
+
100
+ def build_transform(self, input_size):
101
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
102
+ transform = T.Compose([
103
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
104
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
105
+ T.ToTensor(),
106
+ T.Normalize(mean=MEAN, std=STD)
107
+ ])
108
+ return transform
109
+
110
+ def find_closest_aspect_ratio(self, aspect_ratio, target_ratios, width, height, image_size):
111
+ best_ratio_diff = float('inf')
112
+ best_ratio = (1, 1)
113
+ area = width * height
114
+ for ratio in target_ratios:
115
+ target_aspect_ratio = ratio[0] / ratio[1]
116
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
117
+ if ratio_diff < best_ratio_diff:
118
+ best_ratio_diff = ratio_diff
119
+ best_ratio = ratio
120
+ elif ratio_diff == best_ratio_diff:
121
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
122
+ best_ratio = ratio
123
+ return best_ratio
124
+
125
+ def dynamic_preprocess(self, image, min_num=1, max_num=12, image_size=896, use_thumbnail=False):
126
+ orig_width, orig_height = image.size
127
+ aspect_ratio = orig_width / orig_height
128
+
129
+ target_ratios = set(
130
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
131
+ i * j <= max_num and i * j >= min_num)
132
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
133
+
134
+ target_aspect_ratio = self.find_closest_aspect_ratio(
135
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
136
+
137
+ target_width = image_size * target_aspect_ratio[0]
138
+ target_height = image_size * target_aspect_ratio[1]
139
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
140
+
141
+ resized_img = image.resize((target_width, target_height))
142
+ processed_images = []
143
+ for i in range(blocks):
144
+ box = (
145
+ (i % (target_width // image_size)) * image_size,
146
+ (i // (target_width // image_size)) * image_size,
147
+ ((i % (target_width // image_size)) + 1) * image_size,
148
+ ((i // (target_width // image_size)) + 1) * image_size
149
+ )
150
+ split_img = resized_img.crop(box)
151
+ processed_images.append(split_img)
152
+ assert len(processed_images) == blocks
153
+ if use_thumbnail and len(processed_images) != 1:
154
+ thumbnail_img = image.resize((image_size, image_size))
155
+ processed_images.append(thumbnail_img)
156
+ return processed_images
157
+
158
+ def load_image(self, image_file, input_size=896, max_num=12):
159
+ image = Image.open(image_file).convert('RGB')
160
+ transform = self.build_transform(input_size=input_size)
161
+ images = self.dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
162
+ pixel_values = [transform(image) for image in images]
163
+ pixel_values = torch.stack(pixel_values)
164
+ return pixel_values
165
+
166
+
167
+ def forward_multi_image(self, image_paths: list, question: dict):
168
+ pixel_values = []
169
+ num_patches_list = []
170
+ resize_size = 448
171
+ for i, image_path in enumerate(image_paths):
172
+ pixel_value = self.load_image(image_path, input_size=resize_size).to(torch.bfloat16).cuda()
173
+ pixel_values.append(pixel_value)
174
+ num_patches_list.append(pixel_value.size(0))
175
+ pixel_values = torch.cat(tuple(pixel_values), dim=0)
176
+ print(question)
177
+ response, history = self.model.chat(self.tokenizer, pixel_values, question["prompt"], self.generation_config, history=None, return_history=True)
178
+ print(response)
179
+
180
+
181
+ def test_spatial_from_blink():
182
+ hf_path = "iFlyBot/iFlyBotVLM"
183
+ ifly_robo_infer = IflyRoboInference(hf_path)
184
+ question = {
185
+ "idx": "val_Spatial_Relation_143",
186
+ "sub_task" : "Spatial Relation",
187
+ "prompt": "<image> Is the person behind the cup?\nSelect from the following choices.\n(A) yes\n(B) no.\nPlease answer directly with only the letter of the correct option and nothing else."
188
+ }
189
+ image_path = [
190
+ "./examples-images/val_Spatial_Relation_143_1.jpg"
191
+ ]
192
+ ifly_robo_infer.forward_multi_image(image_path, question)
193
+
194
+
195
+ def test_visual_correspondence_from_blink():
196
+ hf_path = "iFlyBot/iFlyBotVLM"
197
+ ifly_robo_infer = IflyRoboInference(hf_path)
198
+ question = {
199
+ "idx": "val_Visual_Correspondence_1",
200
+ "sub_task" : "Visual Correspondence",
201
+ "prompt": "<image> <image> A point is circled on the first image, labeled with REF. We change the camera position or lighting and shoot the second image. You are given multiple red-circled points on the second image, choices of \"A, B, C, D\" are drawn beside each circle. Which point on the second image corresponds to the point in the first image? Select from the following options.\n(A) Point A\n(B) Point B\n(C) Point C\n(D) Point D.\nPlease answer directly with only the letter of the correct option and nothing else."
202
+ }
203
+ image_path = [
204
+ "./examples-images/val_Visual_Correspondence_1_1.jpg",
205
+ "./examples-images/val_Visual_Correspondence_1_2.jpg"
206
+ ]
207
+ ifly_robo_infer.forward_multi_image(image_path, question)
208
+
209
+
210
+ if __name__ == '__main__':
211
+ test_spatial_from_blink()
212
+ test_visual_correspondence_from_blink()
213
+ test_task_plan_from_egoplan2()
214
+ ```
215
+
216
+ </details>