---
license: mit
---

# Model Card

Inspired by [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993) and [Lessons from Archives (Jo & Gebru)](https://arxiv.org/abs/1912.10389), we’re providing some accompanying information about the VIMA model.

## Model Details
VIMA (**Vi**suo**M**otor **A**ttention) is a novel Transformer agent that ingests multimodal prompts and outputs robot arm control autoregressively. VIMA is developed primarily by researchers at Stanford/NVIDIA.

### Model Date
October 2022

### Model Type
VIMA model consists of a pretrained T5 model as the prompt encoder, several tokenizers to process multimodal inputs, and a causal decoder that augoregressively predicts actions given the prompt and interaction history.

### Model Versions
We released 7 checkpoints covering a spectrum of model capacity from 2M to 200M.

## Model Use

### Intended Use
The model is intended to be used alongside [VIMA-Bench](https://github.com/vimalabs/VimaBench) to study general robot manipulation with multimodal prompts.


### Primary intended uses
The primary intended users of these models are AI researchers in robotics, multimodal learning, embodied agents, foundation models, etc.

## Data
The models were trained with [data](https://doi.org/10.5281/zenodo.7127587) generated by oracles implemented in [VIMA-Bench](https://github.com/vimalabs/VimaBench). It includes 650K successful trajectories for behavior cloning. We use 600K trajectories for training. The remaining 50K trajectories are held out for validation purpose.

## Performance and Limitations
### Metrics and Performance
We quantify the performance of trained models using task success percentage aggregated over multiple tasks. We evaluate models' performance on task suite from [VIMA-Bench](https://github.com/vimalabs/VimaBench) and follow the proposed evaluation protocol. See our paper for more details.

### Limitations
Our provided model checkpoints are pre-trained on VIMA-Bench, which may not directly generalize to other simulators or real world. Limitations are further discussed in the paper.

## Paper and Citation

Our paper is posted on [arXiv](https://arxiv.org/abs/2210.03094). If you find our work useful, please consider citing us! 

```bibtex
@inproceedings{jiang2023vima,
  title     = {VIMA: General Robot Manipulation with Multimodal Prompts},
  author    = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan},
  booktitle = {Fortieth International Conference on Machine Learning},
  year      = {2023}
}
```