Edit model card

v-MLLM Model Card

Model details

Model type: v-MLLM is an open-source MLLM trained on Visual-Modality Instruction (VIM) corpus, it can robustly follow the text-modality instructions and visual-modality instructions.

Model date: v-MLLM-13B was trained in January 2024.

Github for more information: https://github.com/VIM-Bench/VIM_TOOL

License

v-MLLM is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Intended use

Primary intended uses: The primary use of v-MLLM is for research on multimodal large language models.

Primary intended users: The primary intended users of the model are researchers in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

  • 846k VIM corpus based on LVIS-Instruct4V corpus.

Citation

Please kindly cite our paper if you find our resources useful:

@misc{li2024text,
      title={Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?}, 
      author={Xiujun Li and Yujie Lu and Zhe Gan and Jianfeng Gao and William Yang Wang and Yejin Choi},
      year={2024},
      eprint={2311.17647},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{lu2023vim,
      title={VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following}, 
      author={Yujie Lu and Xiujun Li and William Yang Wang and Yejin Choi},
      year={2023},
      eprint={2311.17647},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month
7
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.