Update pipeline tag to robotics and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +21 -24
README.md CHANGED
@@ -1,21 +1,20 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
5
  - zh
6
- pipeline_tag: image-text-to-text
7
  library_name: transformers
 
 
8
  tags:
9
  - embodied-ai
10
- - robotics
11
  - vision-language-model
12
  - embodied-reasoning
13
  - spatial-reasoning
14
  - pointing
15
  - vla
16
  - qwen3-vl
17
- base_model:
18
- - Qwen/Qwen3-VL-8B-Instruct
19
  ---
20
 
21
  # Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
@@ -24,20 +23,21 @@ base_model:
24
  🌐 <a href="https://embodied-r.github.io/">Project Page</a> &nbsp;|&nbsp;
25
  πŸ’» <a href="https://github.com/pickxiguapi/Embodied-R1.5">Code</a> &nbsp;|&nbsp;
26
  🧰 <a href="https://github.com/pickxiguapi/EmbodiedEvalKit">EmbodiedEvalKit</a> &nbsp;|&nbsp;
27
- πŸ€— <a href="https://huggingface.co/collections/IffYuan/embodied-r15">Models & Datasets</a>
 
28
  </p>
29
 
30
  > *"Reasoning initiates the action; Action fulfills the reasoning."* β€” Wang Yangming (1509)
31
 
32
  ## Overview
33
 
34
- **Embodied-R1.5** is a unified **Embodied Foundation Model (EFM)**, built on **Qwen3-VL-8B-Instruct**, that integrates comprehensive embodied reasoning within a single architecture. Building on our prior work [Embodied-R1](https://github.com/pickxiguapi/Embodied-R1), it leaps from a pointing specialist to a comprehensive EFM unifying **three core capabilities**:
35
 
36
  - **Spatial cognition & reasoning** β€” comprehend the semantic and spatial structure of the physical world, including static geometric relations and dynamic interaction possibilities.
37
  - **Task planning & correction** β€” cover the full task life cycle: long-horizon decomposition, next-step planning, process detection, error localization, and correction.
38
  - **Embodied pointing & location** β€” ground high-level reasoning in coordinates and trajectories, covering referring expression grounding, region-level localization, functional (affordance) grounding, and visual trace generation.
39
 
40
- Trained on a 15B-token corpus with a multi-task balanced RL recipe, it further drives a **Planner-Grounder-Corrector (PGC)** closed-loop framework where one model acts as planner, grounder, and corrector to autonomously complete long-horizon real-world tasks. With only 8B parameters, Embodied-R1.5 is best on **16 of 24** embodied VLM benchmarks (avg. **70.4%**), surpassing Gemini-Robotics-ER-1.5 and GPT-5.4; with light action-data fine-tuning it adapts into **Embodied-R1.5-VLA**, outperforming strong baselines like $\pi_{0.5}$ across 4 manipulation benchmark suites; and it generalizes zero-shot to real robots on instruction following, affordance grounding, articulated manipulation, and long-horizon tasks.
41
 
42
  ## Output Conventions
43
 
@@ -50,9 +50,15 @@ Embodied-R1.5 follows the Qwen3-VL chat format and outputs structured answers in
50
  | `open-ended` | free text |
51
  | `math` | `$$-\dfrac{3}{2}$$` |
52
  | `spatial grounding` | `{"boxes": [35, 227, 437, 932]}` |
53
- | `point` | ` ```json\n[{"point_2d": [230, 138]}]\n``` ` |
54
- | `trace` | ` ```json\n[{"point_2d": [624, 469]}, ...]\n``` ` |
55
- | `trace_3d` | ` ```json\n[{"point_2d": [463, 599], "depth": 1.08}, ...]\n``` ` |
 
 
 
 
 
 
56
 
57
  > **Coordinate & unit conventions.** All points (`point_2d`) and boxes are normalized to the `[0, 1000]` range, regardless of the original image resolution. For `trace_3d`, the `depth` value is in meters.
58
 
@@ -106,11 +112,9 @@ vllm serve IffYuan/Embodied-R1.5 \
106
  --host 0.0.0.0 --port 22002
107
  ```
108
 
109
- More runnable examples (vLLM online / offline, HuggingFace, point decoding & visualization) are provided in the [GitHub repository](https://github.com/pickxiguapi/Embodied-R1.5) under `inference/`.
110
-
111
  ## Evaluation
112
 
113
- For benchmark evaluation, see [EmbodiedEvalKit](https://github.com/pickxiguapi/EmbodiedEvalKit), our evaluation framework covering 25+ embodied benchmarks.
114
 
115
  ## Training & Data
116
 
@@ -118,12 +122,12 @@ Embodied-R1.5 is trained in two stages: SFT (LLaMA-Factory) followed by RFT (Eas
118
 
119
  ## Citation
120
 
121
- If you find Embodied-R1.5 useful in your research, please cite our work:
122
 
123
  ```bibtex
124
  @article{yuan2026embodiedr15,
125
  title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
126
- author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and Zhang, Shuoheng and Han, Linqi and Li, Yutong and Li, Pengyi and Sun, Jiangeng and Jia, Wenting and Hu, Yucheng and Liu, Yuhao and Liao, Ruihao and Wu, Qiyu and Li, Yuxiao and Zhang, Zhao and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
127
  journal={arXiv preprint},
128
  year={2026}
129
  }
@@ -134,15 +138,8 @@ If you find Embodied-R1.5 useful in your research, please cite our work:
134
  journal={ICLR 2026},
135
  year={2025}
136
  }
137
-
138
- @article{yuan2025seeing,
139
- title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
140
- author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
141
- journal={ICLR 2026},
142
- year={2025}
143
- }
144
  ```
145
 
146
  ## License
147
 
148
- Released under the Apache 2.0 license.
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-VL-8B-Instruct
4
  language:
5
  - en
6
  - zh
 
7
  library_name: transformers
8
+ license: apache-2.0
9
+ pipeline_tag: robotics
10
  tags:
11
  - embodied-ai
 
12
  - vision-language-model
13
  - embodied-reasoning
14
  - spatial-reasoning
15
  - pointing
16
  - vla
17
  - qwen3-vl
 
 
18
  ---
19
 
20
  # Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
 
23
  🌐 <a href="https://embodied-r.github.io/">Project Page</a> &nbsp;|&nbsp;
24
  πŸ’» <a href="https://github.com/pickxiguapi/Embodied-R1.5">Code</a> &nbsp;|&nbsp;
25
  🧰 <a href="https://github.com/pickxiguapi/EmbodiedEvalKit">EmbodiedEvalKit</a> &nbsp;|&nbsp;
26
+ πŸ€— <a href="https://huggingface.co/collections/IffYuan/embodied-r15">Models & Datasets</a> &nbsp;|&nbsp;
27
+ πŸ“„ <a href="https://huggingface.co/papers/2606.11324">Paper</a>
28
  </p>
29
 
30
  > *"Reasoning initiates the action; Action fulfills the reasoning."* β€” Wang Yangming (1509)
31
 
32
  ## Overview
33
 
34
+ **Embodied-R1.5** is a unified **Embodied Foundation Model (EFM)**, built on **Qwen3-VL-8B-Instruct**, that integrates comprehensive embodied reasoning within a single architecture. Building on [Embodied-R1](https://github.com/pickxiguapi/Embodied-R1), it leaps from a pointing specialist to a comprehensive EFM unifying **three core capabilities**:
35
 
36
  - **Spatial cognition & reasoning** β€” comprehend the semantic and spatial structure of the physical world, including static geometric relations and dynamic interaction possibilities.
37
  - **Task planning & correction** β€” cover the full task life cycle: long-horizon decomposition, next-step planning, process detection, error localization, and correction.
38
  - **Embodied pointing & location** β€” ground high-level reasoning in coordinates and trajectories, covering referring expression grounding, region-level localization, functional (affordance) grounding, and visual trace generation.
39
 
40
+ Trained on a 15B-token corpus with a multi-task balanced RL recipe, it further drives a **Planner-Grounder-Corrector (PGC)** closed-loop framework where one model acts as planner, grounder, and corrector to autonomously complete long-horizon real-world tasks.
41
 
42
  ## Output Conventions
43
 
 
50
  | `open-ended` | free text |
51
  | `math` | `$$-\dfrac{3}{2}$$` |
52
  | `spatial grounding` | `{"boxes": [35, 227, 437, 932]}` |
53
+ | `point` | ` ```json
54
+ [{"point_2d": [230, 138]}]
55
+ ``` ` |
56
+ | `trace` | ` ```json
57
+ [{"point_2d": [624, 469]}, ...]
58
+ ``` ` |
59
+ | `trace_3d` | ` ```json
60
+ [{"point_2d": [463, 599], "depth": 1.08}, ...]
61
+ ``` ` |
62
 
63
  > **Coordinate & unit conventions.** All points (`point_2d`) and boxes are normalized to the `[0, 1000]` range, regardless of the original image resolution. For `trace_3d`, the `depth` value is in meters.
64
 
 
112
  --host 0.0.0.0 --port 22002
113
  ```
114
 
 
 
115
  ## Evaluation
116
 
117
+ For benchmark evaluation, see [EmbodiedEvalKit](https://github.com/pickxiguapi/EmbodiedEvalKit), an evaluation framework covering 25+ embodied benchmarks.
118
 
119
  ## Training & Data
120
 
 
122
 
123
  ## Citation
124
 
125
+ If you find Embodied-R1.5 useful in your research, please cite:
126
 
127
  ```bibtex
128
  @article{yuan2026embodiedr15,
129
  title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
130
+ author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and Li, Yutong and Zhang, Shuoheng and Han, Linqi and Li, Pengyi and Sun, Jiangeng and Jia, Wenting and Zhao Zhang and Liu, Yuhao and Liao, Ruihao and Hu, Yucheng and Wu, Qiyu and Li, Yuxiao and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
131
  journal={arXiv preprint},
132
  year={2026}
133
  }
 
138
  journal={ICLR 2026},
139
  year={2025}
140
  }
 
 
 
 
 
 
 
141
  ```
142
 
143
  ## License
144
 
145
+ Released under the Apache 2.0 license.