Edit model card

Model Card

Inspired by Model Cards for Model Reporting (Mitchell et al.) and Lessons from Archives (Jo & Gebru), we’re providing some accompanying information about the VIMA model.

Model Details

VIMA (VisuoMotor Attention) is a novel Transformer agent that ingests multimodal prompts and outputs robot arm control autoregressively. VIMA is developed primarily by researchers at Stanford/NVIDIA.

Model Date

October 2022

Model Type

VIMA model consists of a pretrained T5 model as the prompt encoder, several tokenizers to process multimodal inputs, and a causal decoder that augoregressively predicts actions given the prompt and interaction history.

Model Versions

We released 7 checkpoints covering a spectrum of model capacity from 2M to 200M.

Model Use

Intended Use

The model is intended to be used alongside VIMA-Bench to study general robot manipulation with multimodal prompts.

Primary intended uses

The primary intended users of these models are AI researchers in robotics, multimodal learning, embodied agents, foundation models, etc.


The models were trained with data generated by oracles implemented in VIMA-Bench. It includes 650K successful trajectories for behavior cloning. We use 600K trajectories for training. The remaining 50K trajectories are held out for validation purpose.

Performance and Limitations

Metrics and Performance

We quantify the performance of trained models using task success percentage aggregated over multiple tasks. We evaluate models' performance on task suite from VIMA-Bench and follow the proposed evaluation protocol. See our paper for more details.


Our provided model checkpoints are pre-trained on VIMA-Bench, which may not directly generalize to other simulators or real world. Limitations are further discussed in the paper.

Paper and Citation

Our paper is posted on arXiv. If you find our work useful, please consider citing us!

  title     = {VIMA: General Robot Manipulation with Multimodal Prompts},
  author    = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan},
  booktitle = {Fortieth International Conference on Machine Learning},
  year      = {2023}
Downloads last month
Unable to determine this model's library. Check the docs .