YuanLiuuuuuu
commited on
[Feature]: Add README
Browse files
README.md
CHANGED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- zh
|
6 |
+
base_model:
|
7 |
+
- Qwen/Qwen2.5-7B-Instruct
|
8 |
+
---
|
9 |
+
## POINTS-1-5-Qwen-2-5-7B-Chat
|
10 |
+
|
11 |
+
### Introduction
|
12 |
+
|
13 |
+
We are excited to release the latest update of WePOINTS series, namely POINTS1.5, a much stronger model than POINTS and integrating recent advancement in vision-language model and new techniques proposed by researchers from WeChat AI.
|
14 |
+
|
15 |
+
<p align="center">
|
16 |
+
🏠 <a href="https://github.com/WePOINTS/WePOINTS">Github</a>   |    📑 <a href="">Paper (coming soon)</a>    </a>
|
17 |
+
</p>
|
18 |
+
|
19 |
+
### What's new in POINTS1.5?
|
20 |
+
|
21 |
+
**Key Innovations**
|
22 |
+
|
23 |
+
1. **Native Dynamic High Resolution**: In line with the recent trend in vision-language models, we have replaced the original CLIP vision encoder with a NaViT-style vision encoder. This new encoder can process images at various resolutions without the need for splitting.
|
24 |
+
|
25 |
+
2. **Bilingual Support**: Most of the pre-training and visual instruction tuning datasets in POINTS are in English. In this update, we have added support for Chinese, with plans to include more languages in the future. For the pre-training stage, we followed the strategy proposed by POINTS and created an additional 1 million Chinese pre-training datasets. For the visual instruction tuning stage, we supplemented the original English dataset used in POINTS with a series of Chinese visual instruction tuning datasets sourced from the open-source community. We also collected images and generated corresponding textual question-and-answer pairs using a combination of manual and automated methods. These visual instruction tuning datasets cover various domains, such as optical character recognition and general conversation.
|
26 |
+
|
27 |
+
2. **Quality Control**: We conducted a series of quality control tests on both the pre-training and visual instruction tuning datasets. For instance, we filtered the pre-training dataset using perplexity, following the strategy proposed in POINTS. For the visual instruction tuning datasets, we implemented a combination of filtering strategies, such as removing samples with grammatical errors.
|
28 |
+
|
29 |
+
3. **Model Soup**: In line with POINTS, we also applied model soup techniques to further enhance performance.
|
30 |
+
|
31 |
+
|
32 |
+
<div style="display: flex; justify-content: space-between; gap: 5px;">
|
33 |
+
<img src="https://github.com/user-attachments/assets/a2fd1f54-e36c-45ea-870e-b5be07310e29" alt="model development" style="width: 48%;"/>
|
34 |
+
<img src="https://github.com/user-attachments/assets/c1c5c55e-bcce-4187-b167-084868be99d8" alt="model architecture" style="width: 48%;"/>
|
35 |
+
</div>
|
36 |
+
|
37 |
+
|
38 |
+
### How to use POINTS1.5?
|
39 |
+
|
40 |
+
```python
|
41 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
42 |
+
from wepoints.utils.images import Qwen2ImageProcessorForPOINTSV15
|
43 |
+
import torch
|
44 |
+
from PIL import Image
|
45 |
+
import requests
|
46 |
+
from io import BytesIO
|
47 |
+
|
48 |
+
|
49 |
+
model_path = 'WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat'
|
50 |
+
model = AutoModelForCausalLM.from_pretrained(model_path,
|
51 |
+
trust_remote_code=True,
|
52 |
+
torch_dtype=torch.float16,
|
53 |
+
device_map='cuda')
|
54 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
55 |
+
image_processor = Qwen2ImageProcessorForPOINTSV15.from_pretrained(model_path)
|
56 |
+
|
57 |
+
|
58 |
+
image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
|
59 |
+
response = requests.get(image_url)
|
60 |
+
image_data = BytesIO(response.content)
|
61 |
+
pil_image = Image.open(image_data)
|
62 |
+
pil_image = pil_image.save('image.jpg')
|
63 |
+
prompt = 'please describe the image in detail'
|
64 |
+
|
65 |
+
content = [
|
66 |
+
dict(type='image', image='image.jpg'),
|
67 |
+
dict(type='text', text=prompt)
|
68 |
+
]
|
69 |
+
messages = [
|
70 |
+
{
|
71 |
+
'role': 'user',
|
72 |
+
'content': content
|
73 |
+
}
|
74 |
+
]
|
75 |
+
generation_config = {
|
76 |
+
'max_new_tokens': 1024,
|
77 |
+
'temperature': 0.0,
|
78 |
+
'top_p': 0.0,
|
79 |
+
'num_beams': 1,
|
80 |
+
}
|
81 |
+
response = model.chat(
|
82 |
+
messages,
|
83 |
+
tokenizer,
|
84 |
+
image_processor,
|
85 |
+
generation_config
|
86 |
+
)
|
87 |
+
print(response)
|
88 |
+
```
|
89 |
+
|
90 |
+
### Evaluation
|
91 |
+
|
92 |
+
| Benchmark | Qwen2-VL-7B | POINTS-7B | POINTS1.5-7B |
|
93 |
+
| :-------: | :----------: | :-------------: | :----: |
|
94 |
+
| MMBench-TEST-avg | 81.0 | 78.0 | 80.7 |
|
95 |
+
| MMStar | 60.7 | 60.9 | 61.1 |
|
96 |
+
| MMMU | 53.7 | 51.4 | 53.8 |
|
97 |
+
| MathVista | 61.4 | 63.0 | 66.4 |
|
98 |
+
| HallucinationBench | 50.4 | 45.6 | 50.0 |
|
99 |
+
| AI2D | 83.0 | 81.2 | 81.4 |
|
100 |
+
| OCRBench | 84.3 | 71.7 | 82.3 |
|
101 |
+
| MMVet | 61.8 | 47.9 | 62.2 |
|
102 |
+
| Average | 67.0 | 62.5 | 67.4 |
|
103 |
+
|
104 |
+
|
105 |
+
### Citation
|
106 |
+
|
107 |
+
If you find our work helpful, feel free to cite us:
|
108 |
+
|
109 |
+
```
|
110 |
+
@article{liu2024points,
|
111 |
+
title={POINTS1.5: Building a Vision-Language Model towards Real World Applications},
|
112 |
+
author={Liu, Yuan and Le Tian and Xiao Zhou and Xinyu Gao and Kavio Yu and Yang Yu and Jie Zhou},
|
113 |
+
journal={Coming soon},
|
114 |
+
year={2024}
|
115 |
+
}
|
116 |
+
|
117 |
+
@article{liu2024points,
|
118 |
+
title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
|
119 |
+
author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
|
120 |
+
journal={arXiv preprint arXiv:2409.04828},
|
121 |
+
year={2024}
|
122 |
+
}
|
123 |
+
|
124 |
+
@article{liu2024rethinking,
|
125 |
+
title={Rethinking Overlooked Aspects in Vision-Language Models},
|
126 |
+
author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
|
127 |
+
journal={arXiv preprint arXiv:2405.11850},
|
128 |
+
year={2024}
|
129 |
+
}
|
130 |
+
```
|