--- license: mit --- # Model Card Inspired by [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993) and [Lessons from Archives (Jo & Gebru)](https://arxiv.org/abs/1912.10389), we’re providing some accompanying information about the VIMA model. ## Model Details VIMA (**Vi**suo**M**otor **A**ttention) is a novel Transformer agent that ingests multimodal prompts and outputs robot arm control autoregressively. VIMA is developed primarily by researchers at Stanford/NVIDIA. ### Model Date October 2022 ### Model Type VIMA model consists of a pretrained T5 model as the prompt encoder, several tokenizers to process multimodal inputs, and a causal decoder that augoregressively predicts actions given the prompt and interaction history. ### Model Versions We released 7 checkpoints covering a spectrum of model capacity from 2M to 200M. ## Model Use ### Intended Use The model is intended to be used alongside [VIMA-Bench](https://github.com/vimalabs/VimaBench) to study general robot manipulation with multimodal prompts. ### Primary intended uses The primary intended users of these models are AI researchers in robotics, multimodal learning, embodied agents, foundation models, etc. ## Data The models were trained with [data](https://doi.org/10.5281/zenodo.7127587) generated by oracles implemented in [VIMA-Bench](https://github.com/vimalabs/VimaBench). It includes 650K successful trajectories for behavior cloning. We use 600K trajectories for training. The remaining 50K trajectories are held out for validation purpose. ## Performance and Limitations ### Metrics and Performance We quantify the performance of trained models using task success percentage aggregated over multiple tasks. We evaluate models' performance on task suite from [VIMA-Bench](https://github.com/vimalabs/VimaBench) and follow the proposed evaluation protocol. See our paper for more details. ### Limitations Our provided model checkpoints are pre-trained on VIMA-Bench, which may not directly generalize to other simulators or real world. Limitations are further discussed in the paper. ## Paper and Citation Our paper is posted on [arXiv](https://arxiv.org/abs/2210.03094). If you find our work useful, please consider citing us! ```bibtex @inproceedings{jiang2023vima, title = {VIMA: General Robot Manipulation with Multimodal Prompts}, author = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan}, booktitle = {Fortieth International Conference on Machine Learning}, year = {2023} } ```