Transformers
Safetensors
English
V2PE
Inference Endpoints
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - OpenGVLab/V2PE-Data
5
+ language:
6
+ - en
7
+ base_model:
8
+ - OpenGVLab/InternVL2-2B
9
+ new_version: OpenGVLab/V2PE
10
+ library_name: transformers
11
+ tags:
12
+ - V2PE
13
+ ---
14
+ # V2PE
15
+
16
+ [\[⭐️Project Page\]](https://zzdhybthu.github.io/V2PE.github.io) [\[📜 ArXiv Paper\]](https://arxiv.org/abs/2412.09616) [\[📂 GitHub\]](https://github.com/OpenGVLab/V2PE) [\[📖 HF Datasets\]](https://huggingface.co/datasets/OpenGVLab/V2PE-Data)
17
+
18
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/hLydYFXbs8--Th-tOcQIe.png)
19
+
20
+ ## Introduction
21
+
22
+ Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents.
23
+
24
+ To address this issue, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens.
25
+ Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLM, InternVL2-2B. The finetuned model achieves strong performance on both standard and long-context multimodal tasks.
26
+ Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.
27
+
28
+ This repository contains the instruction-tuned V2PE-32K-InternVL-2B model and V2PE-256K-InternVL-2B model, which have 1.8B activated parameters (3B in total) and are trained on [V2PE-Data](https://huggingface.co/datasets/OpenGVLab/V2PE-Data).
29
+ It is built upon [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B). For more details, please refer to our [paper](https://arxiv.org/abs/2412.09616).
30
+
31
+ ## Performance
32
+
33
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/z9KD8fzQ-pkblVkpOKW7J.png)
34
+
35
+ **General MLLM Benchmarks**
36
+
37
+ | Model | #Param | ChartQA | DocVQA | AI2D | InfoVQA | SQA | POPE | MMMU<sub>val</sub> | MMBench<sub>EN</sub> | SEED<sub>I</sub> | Avg |
38
+ |--------------------------|--------|---------|--------|-------|---------|-------|-------|--------------------|---------------------|------------------|-------|
39
+ | InternVL2-2B | 2.0B | 71.7 | 86.9 | 74.1 | 58.9 | 94.1 | 85.2 | 36.3 | 73.4 | 70.9 | 72.4 |
40
+ | DeepSeek-VL-1.3B | 2.0B | 47.4 | - | 51.5 | - | 68.4 | 85.9 | 33.8 | 66.4 | 66.0 | - |
41
+ | Qwen2-VL-2B | 2.0B | 73.5 | 90.1 | 74.7 | 65.5 | - | - | 41.1 | 74.9 | - | - |
42
+ | Aquila-VL-2B | 2.2B | 32.0 | 85.0 | 75.1 | 58.3 | 95.1 | 83.1 | 46.9 | 79.0 | 73.9 | 69.8 |
43
+ | MiniCPM-V-2 | 2.8B | 55.6 | 71.9 | 62.9 | - | 80.7 | 86.3 | 38.2 | 64.1 | 67.1 | - |
44
+ | Vintern-3B-beta | 3.7B | 68.3 | - | 69.1 | - | 75.0 | 87.4 | 46.7 | 70.6 | 70.0 | - |
45
+ | Llama 3.2 11B | 11B | 83.4 | 88.4 | 91.1 | - | - | - | 50.7 | 68.0 | - | - |
46
+ | Qwen2-VL-72B | 73B | 88.3 | 96.5 | 88.1 | 84.5 | 91.2 | 87.2 | 64.5 | 86.9 | 77.9 | 85.0 |
47
+ | GPT-4o | - | 85.7 | 92.8 | 84.7 | - | 90.1 | 97.2 | 69.1 | 82.1 | 76.7 | - |
48
+ | **InternVL2-V2PE-32K** | 2.0B | **76.4** | **83.9** | **73.2** | **55.9** | **94.9** | **88.8** | **36.6** | **73.5** | **71.2** | **72.5** |
49
+
50
+ **Long-Context MLLM Benchmarks**
51
+
52
+ | Model | #Param | MM-NIAH/Image | MM-NIAH/Text | MM-NIAH/Avg | Milebench/T | Milebench/S | Milebench/NI | Milebench/Avg | VideoMME | MVBench |
53
+ |--------------------------|--------|---------------|--------------|-------------|--------------|--------------|---------------|--------------|------------|------------|
54
+ | InternVL2-2B | 2.0B | 23.0 | 18.9 | 21.0 | 58.2 | 54.5 | 37.0 | 49.9 | - | - |
55
+ | Phi-3-Vision | 2.7B | - | - | - | 46.9 | 50.0 | - | - | - | - |
56
+ | OmChat | 3.9B | - | - | - | 51.4 | 52.0 | - | - | 45.9 | 50.2 |
57
+ | LongLLaVA | 9B | - | - | - | 47.3 | 46.8 | - | - | 43.7 | 49.1 |
58
+ | LongLLaVA | 13B | - | - | - | 52.7 | 52.1 | - | - | 51.6 | 54.6 |
59
+ | VILA | 13B | 14.5 | 40.5 | 27.5 | - | - | - | - | - | - |
60
+ | Gemini-1.5 | - | 28.5 | 82.1 | 55.2 | 50.2 | 58.3 | 97.9 | **68.8** | **69.6** | - |
61
+ | GPT-4V | - | - | 84.1 | - | 45.6 | 58.9 | **99.4** | 68.0 | 59.9 | 43.5 |
62
+ | GPT-4o | - | - | - | - | 56.2 | **63.5** | - | - | 64.7 | - |
63
+ | Claude3-Opus | - | - | - | - | 37.4 | 48.1 | 85.3 | 56.9 | 59.7 | - |
64
+ | **InternVL2-V2PE-32K** | 2.0B | **78.1** | **85.7** | **81.8** | **65.5** | 56.4 | 97.2 | 72.5 | 50.7 | **65.6** |
65
+
66
+ ## Usage
67
+
68
+ Please refer to our [GitHub Repo](https://github.com/OpenGVLab/V2PE).
69
+
70
+ ## License
71
+
72
+ This project is released under the MIT License.
73
+
74
+ ## Citation
75
+
76
+ If you find this work helpful in your research, please consider citing:
77
+
78
+ ```bibtex
79
+ @misc{ge2024v2peimprovingmultimodallongcontext,
80
+ title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding},
81
+ author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu},
82
+ year={2024},
83
+ eprint={2412.09616},
84
+ archivePrefix={arXiv},
85
+ primaryClass={cs.CV},
86
+ url={https://arxiv.org/abs/2412.09616},
87
+ }
88
+ ```