OpenGVLab
/

V2PE

+---
+license: mit
+datasets:
+- OpenGVLab/V2PE-Data
+language:
+- en
+base_model:
+- OpenGVLab/InternVL2-2B
+new_version: OpenGVLab/V2PE
+library_name: transformers
+tags:
+- V2PE
+---
+# V2PE
+[\[⭐️Project Page\]](https://zzdhybthu.github.io/V2PE.github.io) [\[📜 ArXiv Paper\]](https://arxiv.org/abs/2412.09616) [\[📂 GitHub\]](https://github.com/OpenGVLab/V2PE) [\[📖 HF Datasets\]](https://huggingface.co/datasets/OpenGVLab/V2PE-Data)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/hLydYFXbs8--Th-tOcQIe.png)
+## Introduction
+Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents.
+To address this issue, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens.
+Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLM, InternVL2-2B. The finetuned model achieves strong performance on both standard and long-context multimodal tasks.
+Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.
+This repository contains the instruction-tuned V2PE-32K-InternVL-2B model and V2PE-256K-InternVL-2B model, which have 1.8B activated parameters (3B in total) and are trained on [V2PE-Data](https://huggingface.co/datasets/OpenGVLab/V2PE-Data).
+It is built upon [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B). For more details, please refer to our [paper](https://arxiv.org/abs/2412.09616).
+## Performance
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/z9KD8fzQ-pkblVkpOKW7J.png)
+**General MLLM Benchmarks**
+| Model                    | #Param | ChartQA | DocVQA | AI2D  | InfoVQA | SQA   | POPE  | MMMU<sub>val</sub> | MMBench<sub>EN</sub> | SEED<sub>I</sub> | Avg   |
+|--------------------------|--------|---------|--------|-------|---------|-------|-------|--------------------|---------------------|------------------|-------|
+| InternVL2-2B | 2.0B   | 71.7    | 86.9   | 74.1  | 58.9    | 94.1  | 85.2  | 36.3               | 73.4                | 70.9             | 72.4  |
+| DeepSeek-VL-1.3B | 2.0B   | 47.4    | -      | 51.5  | -       | 68.4  | 85.9  | 33.8               | 66.4                | 66.0             | -     |
+| Qwen2-VL-2B  | 2.0B   | 73.5    | 90.1   | 74.7  | 65.5    | -     | -     | 41.1               | 74.9                | -                | -     |
+| Aquila-VL-2B | 2.2B   | 32.0    | 85.0   | 75.1  | 58.3    | 95.1  | 83.1  | 46.9               | 79.0                | 73.9             | 69.8  |
+| MiniCPM-V-2 | 2.8B   | 55.6    | 71.9   | 62.9  | -       | 80.7  | 86.3  | 38.2               | 64.1                | 67.1             | -     |
+| Vintern-3B-beta | 3.7B   | 68.3    | -      | 69.1  | -       | 75.0  | 87.4  | 46.7               | 70.6                | 70.0             | -     |
+| Llama 3.2 11B  | 11B    | 83.4    | 88.4   | 91.1  | -       | -     | -     | 50.7               | 68.0                | -                | -     |
+| Qwen2-VL-72B | 73B    | 88.3    | 96.5   | 88.1  | 84.5    | 91.2  | 87.2  | 64.5               | 86.9                | 77.9             | 85.0  |
+| GPT-4o | -      | 85.7    | 92.8   | 84.7  | -       | 90.1  | 97.2  | 69.1               | 82.1                | 76.7             | -     |
+| **InternVL2-V2PE-32K**   | 2.0B   | **76.4** | **83.9** | **73.2** | **55.9**  | **94.9** | **88.8**  | **36.6**             | **73.5**            | **71.2**          | **72.5** |
+**Long-Context MLLM Benchmarks**
+| Model                     | #Param | MM-NIAH/Image | MM-NIAH/Text | MM-NIAH/Avg | Milebench/T  | Milebench/S  | Milebench/NI | Milebench/Avg | VideoMME   | MVBench   |
+|--------------------------|--------|---------------|--------------|-------------|--------------|--------------|---------------|--------------|------------|------------|
+| InternVL2-2B            | 2.0B   | 23.0          | 18.9         | 21.0        | 58.2         | 54.5         | 37.0          | 49.9         | -      | -      |
+| Phi-3-Vision            | 2.7B   | -         | -        | -       | 46.9         | 50.0         | -         | -         | -      | -      |
+| OmChat                  | 3.9B   | -         | -        | -       | 51.4         | 52.0         | -         | -         | 45.9       | 50.2       |
+| LongLLaVA               | 9B     | -         | -        | -       | 47.3         | 46.8         | -         | -         | 43.7       | 49.1       |
+| LongLLaVA               | 13B    | -         | -        | -       | 52.7         | 52.1         | -         | -         | 51.6       | 54.6       |
+| VILA                    | 13B    | 14.5          | 40.5         | 27.5        | -        | -        | -         | -         | -      | -      |
+| Gemini-1.5              | -  | 28.5          | 82.1         | 55.2        | 50.2         | 58.3         | 97.9          | **68.8**     | **69.6**   | -      |
+| GPT-4V                  | -  | -         | 84.1     | -       | 45.6         | 58.9         | **99.4**      | 68.0         | 59.9       | 43.5       |
+| GPT-4o                  | -  | -         | -        | -       | 56.2         | **63.5**     | -         | -         | 64.7       | -      |
+| Claude3-Opus            | -  | -         | -        | -       | 37.4         | 48.1         | 85.3          | 56.9         | 59.7       | -      |
+| **InternVL2-V2PE-32K**  | 2.0B   | **78.1**      | **85.7**      | **81.8**    | **65.5**     | 56.4        | 97.2    | 72.5      | 50.7      | **65.6** |
+## Usage
+Please refer to our [GitHub Repo](https://github.com/OpenGVLab/V2PE).
+## License
+This project is released under the MIT License.
+## Citation
+If you find this work helpful in your research, please consider citing:
+```bibtex
+@misc{ge2024v2peimprovingmultimodallongcontext,
+      title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding},
+      author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu},
+      year={2024},
+      eprint={2412.09616},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2412.09616},
+}
+```