Commit
•
fca291e
1
Parent(s):
81d64a9
Create README.md (#1)
Browse files- Create README.md (081f6f15a3ef411b267fceb4547f9ff6851e0597)
Co-authored-by: Lin <dreamerlin@users.noreply.huggingface.co>
README.md
ADDED
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- OpenGVLab/V2PE-Data
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
base_model:
|
8 |
+
- OpenGVLab/InternVL2-2B
|
9 |
+
new_version: OpenGVLab/V2PE
|
10 |
+
library_name: transformers
|
11 |
+
tags:
|
12 |
+
- V2PE
|
13 |
+
---
|
14 |
+
# V2PE
|
15 |
+
|
16 |
+
[\[⭐️Project Page\]](https://zzdhybthu.github.io/V2PE.github.io) [\[📜 ArXiv Paper\]](https://arxiv.org/abs/2412.09616) [\[📂 GitHub\]](https://github.com/OpenGVLab/V2PE) [\[📖 HF Datasets\]](https://huggingface.co/datasets/OpenGVLab/V2PE-Data)
|
17 |
+
|
18 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/hLydYFXbs8--Th-tOcQIe.png)
|
19 |
+
|
20 |
+
## Introduction
|
21 |
+
|
22 |
+
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents.
|
23 |
+
|
24 |
+
To address this issue, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens.
|
25 |
+
Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLM, InternVL2-2B. The finetuned model achieves strong performance on both standard and long-context multimodal tasks.
|
26 |
+
Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.
|
27 |
+
|
28 |
+
This repository contains the instruction-tuned V2PE-32K-InternVL-2B model and V2PE-256K-InternVL-2B model, which have 1.8B activated parameters (3B in total) and are trained on [V2PE-Data](https://huggingface.co/datasets/OpenGVLab/V2PE-Data).
|
29 |
+
It is built upon [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B). For more details, please refer to our [paper](https://arxiv.org/abs/2412.09616).
|
30 |
+
|
31 |
+
## Performance
|
32 |
+
|
33 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/z9KD8fzQ-pkblVkpOKW7J.png)
|
34 |
+
|
35 |
+
**General MLLM Benchmarks**
|
36 |
+
|
37 |
+
| Model | #Param | ChartQA | DocVQA | AI2D | InfoVQA | SQA | POPE | MMMU<sub>val</sub> | MMBench<sub>EN</sub> | SEED<sub>I</sub> | Avg |
|
38 |
+
|--------------------------|--------|---------|--------|-------|---------|-------|-------|--------------------|---------------------|------------------|-------|
|
39 |
+
| InternVL2-2B | 2.0B | 71.7 | 86.9 | 74.1 | 58.9 | 94.1 | 85.2 | 36.3 | 73.4 | 70.9 | 72.4 |
|
40 |
+
| DeepSeek-VL-1.3B | 2.0B | 47.4 | - | 51.5 | - | 68.4 | 85.9 | 33.8 | 66.4 | 66.0 | - |
|
41 |
+
| Qwen2-VL-2B | 2.0B | 73.5 | 90.1 | 74.7 | 65.5 | - | - | 41.1 | 74.9 | - | - |
|
42 |
+
| Aquila-VL-2B | 2.2B | 32.0 | 85.0 | 75.1 | 58.3 | 95.1 | 83.1 | 46.9 | 79.0 | 73.9 | 69.8 |
|
43 |
+
| MiniCPM-V-2 | 2.8B | 55.6 | 71.9 | 62.9 | - | 80.7 | 86.3 | 38.2 | 64.1 | 67.1 | - |
|
44 |
+
| Vintern-3B-beta | 3.7B | 68.3 | - | 69.1 | - | 75.0 | 87.4 | 46.7 | 70.6 | 70.0 | - |
|
45 |
+
| Llama 3.2 11B | 11B | 83.4 | 88.4 | 91.1 | - | - | - | 50.7 | 68.0 | - | - |
|
46 |
+
| Qwen2-VL-72B | 73B | 88.3 | 96.5 | 88.1 | 84.5 | 91.2 | 87.2 | 64.5 | 86.9 | 77.9 | 85.0 |
|
47 |
+
| GPT-4o | - | 85.7 | 92.8 | 84.7 | - | 90.1 | 97.2 | 69.1 | 82.1 | 76.7 | - |
|
48 |
+
| **InternVL2-V2PE-32K** | 2.0B | **76.4** | **83.9** | **73.2** | **55.9** | **94.9** | **88.8** | **36.6** | **73.5** | **71.2** | **72.5** |
|
49 |
+
|
50 |
+
**Long-Context MLLM Benchmarks**
|
51 |
+
|
52 |
+
| Model | #Param | MM-NIAH/Image | MM-NIAH/Text | MM-NIAH/Avg | Milebench/T | Milebench/S | Milebench/NI | Milebench/Avg | VideoMME | MVBench |
|
53 |
+
|--------------------------|--------|---------------|--------------|-------------|--------------|--------------|---------------|--------------|------------|------------|
|
54 |
+
| InternVL2-2B | 2.0B | 23.0 | 18.9 | 21.0 | 58.2 | 54.5 | 37.0 | 49.9 | - | - |
|
55 |
+
| Phi-3-Vision | 2.7B | - | - | - | 46.9 | 50.0 | - | - | - | - |
|
56 |
+
| OmChat | 3.9B | - | - | - | 51.4 | 52.0 | - | - | 45.9 | 50.2 |
|
57 |
+
| LongLLaVA | 9B | - | - | - | 47.3 | 46.8 | - | - | 43.7 | 49.1 |
|
58 |
+
| LongLLaVA | 13B | - | - | - | 52.7 | 52.1 | - | - | 51.6 | 54.6 |
|
59 |
+
| VILA | 13B | 14.5 | 40.5 | 27.5 | - | - | - | - | - | - |
|
60 |
+
| Gemini-1.5 | - | 28.5 | 82.1 | 55.2 | 50.2 | 58.3 | 97.9 | **68.8** | **69.6** | - |
|
61 |
+
| GPT-4V | - | - | 84.1 | - | 45.6 | 58.9 | **99.4** | 68.0 | 59.9 | 43.5 |
|
62 |
+
| GPT-4o | - | - | - | - | 56.2 | **63.5** | - | - | 64.7 | - |
|
63 |
+
| Claude3-Opus | - | - | - | - | 37.4 | 48.1 | 85.3 | 56.9 | 59.7 | - |
|
64 |
+
| **InternVL2-V2PE-32K** | 2.0B | **78.1** | **85.7** | **81.8** | **65.5** | 56.4 | 97.2 | 72.5 | 50.7 | **65.6** |
|
65 |
+
|
66 |
+
## Usage
|
67 |
+
|
68 |
+
Please refer to our [GitHub Repo](https://github.com/OpenGVLab/V2PE).
|
69 |
+
|
70 |
+
## License
|
71 |
+
|
72 |
+
This project is released under the MIT License.
|
73 |
+
|
74 |
+
## Citation
|
75 |
+
|
76 |
+
If you find this work helpful in your research, please consider citing:
|
77 |
+
|
78 |
+
```bibtex
|
79 |
+
@misc{ge2024v2peimprovingmultimodallongcontext,
|
80 |
+
title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding},
|
81 |
+
author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu},
|
82 |
+
year={2024},
|
83 |
+
eprint={2412.09616},
|
84 |
+
archivePrefix={arXiv},
|
85 |
+
primaryClass={cs.CV},
|
86 |
+
url={https://arxiv.org/abs/2412.09616},
|
87 |
+
}
|
88 |
+
```
|