RoboPoint-v1-Llama2-13B

RoboPoint is an open-source vision-language model instruction-tuned on a mix of robotics and VQA data. Given an image with language instructions, it outputs precise action guidance as points.

Primary Use Cases

RoboPoint can predict spatial affordances—where actions should be taken in relation to other entities—based on instructions. For example, it can identify free space on a shelf in front of the rightmost object.

Model Details

This model was fine-tuned from meta-llama/Llama-2-13b-chat-hf and has 13 billion parameters.

Date

This model was trained in June 2024.

Resources for More Information

Paper: https://arxiv.org/pdf/2406.10721
Code: https://github.com/wentaoyuan/RoboPoint
Website: https://robo-point.github.io

Training dataset

See wentao-yuan/robopoint-data.

Citation

If you find our work helpful, please consider citing our paper.

@article{yuan2024robopoint,
  title={RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics},
  author={Yuan, Wentao and Duan, Jiafei and Blukis, Valts and Pumacay, Wilbert and Krishna, Ranjay and Murali, Adithyavairavan and Mousavian, Arsalan and Fox, Dieter},
  journal={arXiv preprint arXiv:2406.10721},
  year={2024}
}

wentao-yuan
/

robopoint-v1-llama-2-13b