LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Abstract
Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
Community
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Introduction
In this paper, we propose LLaRA, a framework to turn robot expert trajectories into conversation-style data and other auxiliary data for instruction tuning. We then finetune a pretrained vision-language model on that and successfully turn it into a strong robot manipulation policy. So how did we do that?
Visuomotor Instruction Tuning
First, we transform a typical behavior cloning dataset into an instruction-tuning dataset and finetune a VLM (LLaVA) on this dataset. The resulting LLaRA framework benefits from the broad, inherent knowledge embedded within the VLM, enabling better visuomotor task learning.
Supercharging Visuomotor Instruction Dataset
Then we create auxiliary robotics instruction tuning datasets from the same source to enhance the VLM policy. The idea is that the auxiliary datasets will drive VLMs to learn a better spatio-temporal understanding of the scene and eventually benefit robot learning.
Note that the auxiliary datasets were constructed from the same robot expert trajectories in a self-supervised fashion. The process depends only on object detection and runs in a self-supervised fashion without taking advantage of any external data.
Experiments (Real-world and simulated)
We extensively studied the best practices for auxiliary datasets and found that our auxiliary datasets can significantly enhance VLM policy performance, especially with limited original data. On VIMA Bench, our method consistently outperforms the RT-2 style baseline.
We ran multiple types of real-world robot experiments and found that our method, trained on just 8k simulated data, performs strongly in unseen real-world settings. In addition, with minimal in-domain fine-tuning, the model achieves a 91.6% average success rate.
Conclusion
In conclusion, LLaRA turns an instruction-tuned vision language model (VLM) into a robot policy using curated instruction-tuning datasets and it shows great potential.
For more details:
Paper: http://arxiv.org/abs/2406.20095
Code: https://github.com/LostXine/LLaRA
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper