RoboPoint-v1-Vicuna-13B

RoboPoint is an open-source vision-language model instruction-tuned on a mix of robotics and VQA data. Given an image with language instructions, it outputs precise action guidance as points.

Primary Use Cases

RoboPoint can predict spatial affordances—where actions should be taken in relation to other entities—based on instructions. For example, it can identify free space on a shelf in front of the rightmost object.

Model Details

This model was fine-tuned from lmsys/vicuna-13b-v1.5 and has 13 billion parameters.

Date

This model was trained in June 2024.

Resources for More Information

Training dataset

See wentao-yuan/robopoint-data.

Citation

If you find our work helpful, please consider citing our paper.

@article{yuan2024robopoint,
  title={RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics},
  author={Yuan, Wentao and Duan, Jiafei and Blukis, Valts and Pumacay, Wilbert and Krishna, Ranjay and Murali, Adithyavairavan and Mousavian, Arsalan and Fox, Dieter},
  journal={arXiv preprint arXiv:2406.10721},
  year={2024}
}
Downloads last month
489
Safetensors
Model size
13.4B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for wentao-yuan/robopoint-v1-vicuna-v1.5-13b

Finetuned
(3)
this model

Dataset used to train wentao-yuan/robopoint-v1-vicuna-v1.5-13b