Robotics
Transformers
Safetensors
English
Chinese
qwen3_vl
image-text-to-text
embodied-ai
vision-language-model
embodied-reasoning
spatial-reasoning
pointing
vla
qwen3-vl
Instructions to use IffYuan/Embodied-R1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use IffYuan/Embodied-R1.5 with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("IffYuan/Embodied-R1.5") model = AutoModelForMultimodalLM.from_pretrained("IffYuan/Embodied-R1.5") - Notebooks
- Google Colab
- Kaggle
Update pipeline tag to robotics and improve model card
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,21 +1,20 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
- zh
|
| 6 |
-
pipeline_tag: image-text-to-text
|
| 7 |
library_name: transformers
|
|
|
|
|
|
|
| 8 |
tags:
|
| 9 |
- embodied-ai
|
| 10 |
-
- robotics
|
| 11 |
- vision-language-model
|
| 12 |
- embodied-reasoning
|
| 13 |
- spatial-reasoning
|
| 14 |
- pointing
|
| 15 |
- vla
|
| 16 |
- qwen3-vl
|
| 17 |
-
base_model:
|
| 18 |
-
- Qwen/Qwen3-VL-8B-Instruct
|
| 19 |
---
|
| 20 |
|
| 21 |
# Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
|
|
@@ -24,20 +23,21 @@ base_model:
|
|
| 24 |
π <a href="https://embodied-r.github.io/">Project Page</a> |
|
| 25 |
π» <a href="https://github.com/pickxiguapi/Embodied-R1.5">Code</a> |
|
| 26 |
π§° <a href="https://github.com/pickxiguapi/EmbodiedEvalKit">EmbodiedEvalKit</a> |
|
| 27 |
-
π€ <a href="https://huggingface.co/collections/IffYuan/embodied-r15">Models & Datasets</a>
|
|
|
|
| 28 |
</p>
|
| 29 |
|
| 30 |
> *"Reasoning initiates the action; Action fulfills the reasoning."* β Wang Yangming (1509)
|
| 31 |
|
| 32 |
## Overview
|
| 33 |
|
| 34 |
-
**Embodied-R1.5** is a unified **Embodied Foundation Model (EFM)**, built on **Qwen3-VL-8B-Instruct**, that integrates comprehensive embodied reasoning within a single architecture. Building on
|
| 35 |
|
| 36 |
- **Spatial cognition & reasoning** β comprehend the semantic and spatial structure of the physical world, including static geometric relations and dynamic interaction possibilities.
|
| 37 |
- **Task planning & correction** β cover the full task life cycle: long-horizon decomposition, next-step planning, process detection, error localization, and correction.
|
| 38 |
- **Embodied pointing & location** β ground high-level reasoning in coordinates and trajectories, covering referring expression grounding, region-level localization, functional (affordance) grounding, and visual trace generation.
|
| 39 |
|
| 40 |
-
Trained on a 15B-token corpus with a multi-task balanced RL recipe, it further drives a **Planner-Grounder-Corrector (PGC)** closed-loop framework where one model acts as planner, grounder, and corrector to autonomously complete long-horizon real-world tasks.
|
| 41 |
|
| 42 |
## Output Conventions
|
| 43 |
|
|
@@ -50,9 +50,15 @@ Embodied-R1.5 follows the Qwen3-VL chat format and outputs structured answers in
|
|
| 50 |
| `open-ended` | free text |
|
| 51 |
| `math` | `$$-\dfrac{3}{2}$$` |
|
| 52 |
| `spatial grounding` | `{"boxes": [35, 227, 437, 932]}` |
|
| 53 |
-
| `point` | ` ```json
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
> **Coordinate & unit conventions.** All points (`point_2d`) and boxes are normalized to the `[0, 1000]` range, regardless of the original image resolution. For `trace_3d`, the `depth` value is in meters.
|
| 58 |
|
|
@@ -106,11 +112,9 @@ vllm serve IffYuan/Embodied-R1.5 \
|
|
| 106 |
--host 0.0.0.0 --port 22002
|
| 107 |
```
|
| 108 |
|
| 109 |
-
More runnable examples (vLLM online / offline, HuggingFace, point decoding & visualization) are provided in the [GitHub repository](https://github.com/pickxiguapi/Embodied-R1.5) under `inference/`.
|
| 110 |
-
|
| 111 |
## Evaluation
|
| 112 |
|
| 113 |
-
For benchmark evaluation, see [EmbodiedEvalKit](https://github.com/pickxiguapi/EmbodiedEvalKit),
|
| 114 |
|
| 115 |
## Training & Data
|
| 116 |
|
|
@@ -118,12 +122,12 @@ Embodied-R1.5 is trained in two stages: SFT (LLaMA-Factory) followed by RFT (Eas
|
|
| 118 |
|
| 119 |
## Citation
|
| 120 |
|
| 121 |
-
If you find Embodied-R1.5 useful in your research, please cite
|
| 122 |
|
| 123 |
```bibtex
|
| 124 |
@article{yuan2026embodiedr15,
|
| 125 |
title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
|
| 126 |
-
author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and
|
| 127 |
journal={arXiv preprint},
|
| 128 |
year={2026}
|
| 129 |
}
|
|
@@ -134,15 +138,8 @@ If you find Embodied-R1.5 useful in your research, please cite our work:
|
|
| 134 |
journal={ICLR 2026},
|
| 135 |
year={2025}
|
| 136 |
}
|
| 137 |
-
|
| 138 |
-
@article{yuan2025seeing,
|
| 139 |
-
title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
|
| 140 |
-
author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
|
| 141 |
-
journal={ICLR 2026},
|
| 142 |
-
year={2025}
|
| 143 |
-
}
|
| 144 |
```
|
| 145 |
|
| 146 |
## License
|
| 147 |
|
| 148 |
-
Released under the Apache 2.0 license.
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-VL-8B-Instruct
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
- zh
|
|
|
|
| 7 |
library_name: transformers
|
| 8 |
+
license: apache-2.0
|
| 9 |
+
pipeline_tag: robotics
|
| 10 |
tags:
|
| 11 |
- embodied-ai
|
|
|
|
| 12 |
- vision-language-model
|
| 13 |
- embodied-reasoning
|
| 14 |
- spatial-reasoning
|
| 15 |
- pointing
|
| 16 |
- vla
|
| 17 |
- qwen3-vl
|
|
|
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
# Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
|
|
|
|
| 23 |
π <a href="https://embodied-r.github.io/">Project Page</a> |
|
| 24 |
π» <a href="https://github.com/pickxiguapi/Embodied-R1.5">Code</a> |
|
| 25 |
π§° <a href="https://github.com/pickxiguapi/EmbodiedEvalKit">EmbodiedEvalKit</a> |
|
| 26 |
+
π€ <a href="https://huggingface.co/collections/IffYuan/embodied-r15">Models & Datasets</a> |
|
| 27 |
+
π <a href="https://huggingface.co/papers/2606.11324">Paper</a>
|
| 28 |
</p>
|
| 29 |
|
| 30 |
> *"Reasoning initiates the action; Action fulfills the reasoning."* β Wang Yangming (1509)
|
| 31 |
|
| 32 |
## Overview
|
| 33 |
|
| 34 |
+
**Embodied-R1.5** is a unified **Embodied Foundation Model (EFM)**, built on **Qwen3-VL-8B-Instruct**, that integrates comprehensive embodied reasoning within a single architecture. Building on [Embodied-R1](https://github.com/pickxiguapi/Embodied-R1), it leaps from a pointing specialist to a comprehensive EFM unifying **three core capabilities**:
|
| 35 |
|
| 36 |
- **Spatial cognition & reasoning** β comprehend the semantic and spatial structure of the physical world, including static geometric relations and dynamic interaction possibilities.
|
| 37 |
- **Task planning & correction** β cover the full task life cycle: long-horizon decomposition, next-step planning, process detection, error localization, and correction.
|
| 38 |
- **Embodied pointing & location** β ground high-level reasoning in coordinates and trajectories, covering referring expression grounding, region-level localization, functional (affordance) grounding, and visual trace generation.
|
| 39 |
|
| 40 |
+
Trained on a 15B-token corpus with a multi-task balanced RL recipe, it further drives a **Planner-Grounder-Corrector (PGC)** closed-loop framework where one model acts as planner, grounder, and corrector to autonomously complete long-horizon real-world tasks.
|
| 41 |
|
| 42 |
## Output Conventions
|
| 43 |
|
|
|
|
| 50 |
| `open-ended` | free text |
|
| 51 |
| `math` | `$$-\dfrac{3}{2}$$` |
|
| 52 |
| `spatial grounding` | `{"boxes": [35, 227, 437, 932]}` |
|
| 53 |
+
| `point` | ` ```json
|
| 54 |
+
[{"point_2d": [230, 138]}]
|
| 55 |
+
``` ` |
|
| 56 |
+
| `trace` | ` ```json
|
| 57 |
+
[{"point_2d": [624, 469]}, ...]
|
| 58 |
+
``` ` |
|
| 59 |
+
| `trace_3d` | ` ```json
|
| 60 |
+
[{"point_2d": [463, 599], "depth": 1.08}, ...]
|
| 61 |
+
``` ` |
|
| 62 |
|
| 63 |
> **Coordinate & unit conventions.** All points (`point_2d`) and boxes are normalized to the `[0, 1000]` range, regardless of the original image resolution. For `trace_3d`, the `depth` value is in meters.
|
| 64 |
|
|
|
|
| 112 |
--host 0.0.0.0 --port 22002
|
| 113 |
```
|
| 114 |
|
|
|
|
|
|
|
| 115 |
## Evaluation
|
| 116 |
|
| 117 |
+
For benchmark evaluation, see [EmbodiedEvalKit](https://github.com/pickxiguapi/EmbodiedEvalKit), an evaluation framework covering 25+ embodied benchmarks.
|
| 118 |
|
| 119 |
## Training & Data
|
| 120 |
|
|
|
|
| 122 |
|
| 123 |
## Citation
|
| 124 |
|
| 125 |
+
If you find Embodied-R1.5 useful in your research, please cite:
|
| 126 |
|
| 127 |
```bibtex
|
| 128 |
@article{yuan2026embodiedr15,
|
| 129 |
title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
|
| 130 |
+
author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and Li, Yutong and Zhang, Shuoheng and Han, Linqi and Li, Pengyi and Sun, Jiangeng and Jia, Wenting and Zhao Zhang and Liu, Yuhao and Liao, Ruihao and Hu, Yucheng and Wu, Qiyu and Li, Yuxiao and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
|
| 131 |
journal={arXiv preprint},
|
| 132 |
year={2026}
|
| 133 |
}
|
|
|
|
| 138 |
journal={ICLR 2026},
|
| 139 |
year={2025}
|
| 140 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
## License
|
| 144 |
|
| 145 |
+
Released under the Apache 2.0 license.
|