PVIT model

This is the model weights of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models.

Model description

Position-enhanced Visual Instruction Tuning (PVIT) extends the MLLM by incorporating an additional region-level vision encoder to facilitate support for region-based inputs. Specifically, we adopt the vision encoder from RegionCLIP and utilize it to extract region-level features by taking images and regions as inputs. As an additional source of information, the incorporation of region-level features in this way has a minimal impact on the original MLLM. Furthermore, since the features provided by RegionCLIP are themselves already aligned to the language at a fine-grained level, the overhead of aligning it to the MLLM will be relatively small. Following LLaVA, we design a two-stage training strategy for PVIT that first pre-training a linear projection to align the region features to the LLM word embedding, followed by end-to-end fine-tuning to follow complex fine-grained instructions.

For more details, please refer to our paper and github repo.

How to use

Users have to apply it on top of the original LLaMA weights to get actual PVIT weights. See here for instructions.

Intended use

Primary intended uses: The primary use of PVIT is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

BibTeX entry and citation info

@misc{chen2023positionenhanced,
      title={Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models}, 
      author={Chi Chen and Ruoyu Qin and Fuwen Luo and Xiaoyue Mi and Peng Li and Maosong Sun and Yang Liu},
      year={2023},
      eprint={2308.13437},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}