File size: 6,743 Bytes
fca291e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
license: mit
datasets:
- OpenGVLab/V2PE-Data
language:
- en
base_model:
- OpenGVLab/InternVL2-2B
new_version: OpenGVLab/V2PE
library_name: transformers
tags:
- V2PE
---
# V2PE
[\[⭐️Project Page\]](https://zzdhybthu.github.io/V2PE.github.io) [\[📜 ArXiv Paper\]](https://arxiv.org/abs/2412.09616) [\[📂 GitHub\]](https://github.com/OpenGVLab/V2PE) [\[📖 HF Datasets\]](https://huggingface.co/datasets/OpenGVLab/V2PE-Data)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/hLydYFXbs8--Th-tOcQIe.png)
## Introduction
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents.
To address this issue, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens.
Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLM, InternVL2-2B. The finetuned model achieves strong performance on both standard and long-context multimodal tasks.
Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.
This repository contains the instruction-tuned V2PE-32K-InternVL-2B model and V2PE-256K-InternVL-2B model, which have 1.8B activated parameters (3B in total) and are trained on [V2PE-Data](https://huggingface.co/datasets/OpenGVLab/V2PE-Data).
It is built upon [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B). For more details, please refer to our [paper](https://arxiv.org/abs/2412.09616).
## Performance
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/z9KD8fzQ-pkblVkpOKW7J.png)
**General MLLM Benchmarks**
| Model | #Param | ChartQA | DocVQA | AI2D | InfoVQA | SQA | POPE | MMMU<sub>val</sub> | MMBench<sub>EN</sub> | SEED<sub>I</sub> | Avg |
|--------------------------|--------|---------|--------|-------|---------|-------|-------|--------------------|---------------------|------------------|-------|
| InternVL2-2B | 2.0B | 71.7 | 86.9 | 74.1 | 58.9 | 94.1 | 85.2 | 36.3 | 73.4 | 70.9 | 72.4 |
| DeepSeek-VL-1.3B | 2.0B | 47.4 | - | 51.5 | - | 68.4 | 85.9 | 33.8 | 66.4 | 66.0 | - |
| Qwen2-VL-2B | 2.0B | 73.5 | 90.1 | 74.7 | 65.5 | - | - | 41.1 | 74.9 | - | - |
| Aquila-VL-2B | 2.2B | 32.0 | 85.0 | 75.1 | 58.3 | 95.1 | 83.1 | 46.9 | 79.0 | 73.9 | 69.8 |
| MiniCPM-V-2 | 2.8B | 55.6 | 71.9 | 62.9 | - | 80.7 | 86.3 | 38.2 | 64.1 | 67.1 | - |
| Vintern-3B-beta | 3.7B | 68.3 | - | 69.1 | - | 75.0 | 87.4 | 46.7 | 70.6 | 70.0 | - |
| Llama 3.2 11B | 11B | 83.4 | 88.4 | 91.1 | - | - | - | 50.7 | 68.0 | - | - |
| Qwen2-VL-72B | 73B | 88.3 | 96.5 | 88.1 | 84.5 | 91.2 | 87.2 | 64.5 | 86.9 | 77.9 | 85.0 |
| GPT-4o | - | 85.7 | 92.8 | 84.7 | - | 90.1 | 97.2 | 69.1 | 82.1 | 76.7 | - |
| **InternVL2-V2PE-32K** | 2.0B | **76.4** | **83.9** | **73.2** | **55.9** | **94.9** | **88.8** | **36.6** | **73.5** | **71.2** | **72.5** |
**Long-Context MLLM Benchmarks**
| Model | #Param | MM-NIAH/Image | MM-NIAH/Text | MM-NIAH/Avg | Milebench/T | Milebench/S | Milebench/NI | Milebench/Avg | VideoMME | MVBench |
|--------------------------|--------|---------------|--------------|-------------|--------------|--------------|---------------|--------------|------------|------------|
| InternVL2-2B | 2.0B | 23.0 | 18.9 | 21.0 | 58.2 | 54.5 | 37.0 | 49.9 | - | - |
| Phi-3-Vision | 2.7B | - | - | - | 46.9 | 50.0 | - | - | - | - |
| OmChat | 3.9B | - | - | - | 51.4 | 52.0 | - | - | 45.9 | 50.2 |
| LongLLaVA | 9B | - | - | - | 47.3 | 46.8 | - | - | 43.7 | 49.1 |
| LongLLaVA | 13B | - | - | - | 52.7 | 52.1 | - | - | 51.6 | 54.6 |
| VILA | 13B | 14.5 | 40.5 | 27.5 | - | - | - | - | - | - |
| Gemini-1.5 | - | 28.5 | 82.1 | 55.2 | 50.2 | 58.3 | 97.9 | **68.8** | **69.6** | - |
| GPT-4V | - | - | 84.1 | - | 45.6 | 58.9 | **99.4** | 68.0 | 59.9 | 43.5 |
| GPT-4o | - | - | - | - | 56.2 | **63.5** | - | - | 64.7 | - |
| Claude3-Opus | - | - | - | - | 37.4 | 48.1 | 85.3 | 56.9 | 59.7 | - |
| **InternVL2-V2PE-32K** | 2.0B | **78.1** | **85.7** | **81.8** | **65.5** | 56.4 | 97.2 | 72.5 | 50.7 | **65.6** |
## Usage
Please refer to our [GitHub Repo](https://github.com/OpenGVLab/V2PE).
## License
This project is released under the MIT License.
## Citation
If you find this work helpful in your research, please consider citing:
```bibtex
@misc{ge2024v2peimprovingmultimodallongcontext,
title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding},
author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu},
year={2024},
eprint={2412.09616},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.09616},
}
``` |