Yirany commited on
Commit
282b2eb
β€’
1 Parent(s): ab38e50

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +282 -15
README.md CHANGED
@@ -10,38 +10,305 @@ datasets:
10
  [GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Demo](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5)
11
 
12
 
13
- ## MiniCPM-Llama3-V 2.5
 
 
 
 
 
 
14
 
15
  **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
16
 
17
  - πŸ”₯ **Leading Performance.**
18
- MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **It surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 with 8B parameters**, greatly outperforming other multimodal large models built on Llama 3.
19
 
20
  - πŸ’ͺ **Strong OCR Capabilities.**
21
- MiniCPM-Llama3-V 2.5 can process images with any aspect ratio up to 1.8 million pixels, achieving an **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
22
 
23
  - πŸ† **Trustworthy Behavior.**
24
- Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technology in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits trustworthy multimodal behavior. It achieves **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best level within the open-source community.
25
 
26
  - 🌏 **Multilingual Support.**
27
- Thanks to Llama 3’s robust multilingual capabilities and VisCPM's cross-lingual generalization technology, MiniCPM-Llama3-V 2.5 extends its foundational bilingual (Chinese-English) multimodal capabilities to support **30+ languages including German, French, Spanish, Italian, Russian etc.** We achieve this extension through only minimal instruction-tuning with translated multimodal data. [All Supported Languages](https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-llama-v-2-5_languages.md).
28
 
29
  - πŸš€ **Efficient Deployment.**
30
- MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations** as acceleration techniques, achieving high-efficiency deployment on edge devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150-fold acceleration in multimodal large model edge-side image encoding** and a **3-fold increase in language decoding speed**.
31
 
32
  ### Evaluation <!-- omit in toc -->
33
 
34
- <div align="center">
35
- <img src=/openbmb/MiniCPM-Llama3-V-2_5/resolve/main/assets/MiniCPM-Llama3-V-2.5-peformance.png width=80% />
36
- </div>
37
-
38
  Results on TextVQA, DocVQA, OCRBench, OpenCompass MultiModal Avg , MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench.
39
- <div align="center">
40
- <img src=/openbmb/MiniCPM-Llama3-V-2_5/resolve/main/assets/MiniCPM-Llama3-V-2.5-benchmark.png width=140% />
41
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
 
44
- Results of LLaVA Bench in multiple languages.
45
  <div align="center">
46
  <img src="assets/llavabench_compare.png" width="80%" />
47
  </div>
@@ -73,7 +340,7 @@ We deploy MiniCPM-Llama3-V 2.5 on end devices. The demo video is the raw screen
73
 
74
 
75
  ## Demo
76
- Click here to try out the Demo of [MiniCPM-Llama3-V 2.5](http://120.92.209.146:8889).
77
 
78
  ## Deployment on Mobile Phone
79
  Coming soon.
 
10
  [GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Demo](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5)
11
 
12
 
13
+ ## News <!-- omit in toc -->
14
+
15
+ * [2024.05.23] πŸ” We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmarks evaluations, and multilingual capabilities πŸŒŸπŸ“ŠπŸŒ. Click [here](./docs/compare_with_phi-3_vision.md) to view more details.
16
+ * [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](./finetune/readme.md). Try it now!
17
+
18
+
19
+ ## Model Summary
20
 
21
  **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
22
 
23
  - πŸ”₯ **Leading Performance.**
24
+ MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
25
 
26
  - πŸ’ͺ **Strong OCR Capabilities.**
27
+ MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving an **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
28
 
29
  - πŸ† **Trustworthy Behavior.**
30
+ Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technology in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community.
31
 
32
  - 🌏 **Multilingual Support.**
33
+ Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Russian etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
34
 
35
  - πŸš€ **Efficient Deployment.**
36
+ MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on edge devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150-fold acceleration in multimodal large model end-side image encoding** and a **3-fold increase in language decoding speed**.
37
 
38
  ### Evaluation <!-- omit in toc -->
39
 
 
 
 
 
40
  Results on TextVQA, DocVQA, OCRBench, OpenCompass MultiModal Avg , MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench.
41
+ <table style="margin: 0px auto;">
42
+ <thead>
43
+ <tr>
44
+ <th align="left">Model</th>
45
+ <th>Size</th>
46
+ <th>OCRBench</th>
47
+ <th>TextVQA val</th>
48
+ <th>DocVQA test</th>
49
+ <th>Open-Compass</th>
50
+ <th>MME</th>
51
+ <th>MMB test (en)</th>
52
+ <th>MMB test (cn)</th>
53
+ <th>MMMU val</th>
54
+ <th>Math-Vista</th>
55
+ <th>LLaVA Bench</th>
56
+ <th>RealWorld QA</th>
57
+ <th>Object HalBench</th>
58
+ </tr>
59
+ </thead>
60
+ <tbody align="center">
61
+ <tr>
62
+ <td colspan="14" align="left"><strong>Proprietary</strong></td>
63
+ </tr>
64
+ <tr>
65
+ <td nowrap="nowrap" align="left">Gemini Pro</td>
66
+ <td>-</td>
67
+ <td>680</td>
68
+ <td>74.6</td>
69
+ <td>88.1</td>
70
+ <td>62.9</td>
71
+ <td>2148.9</td>
72
+ <td>73.6</td>
73
+ <td>74.3</td>
74
+ <td>48.9</td>
75
+ <td>45.8</td>
76
+ <td>79.9</td>
77
+ <td>60.4</td>
78
+ <td>-</td>
79
+ </tr>
80
+ <tr>
81
+ <td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
82
+ <td>-</td>
83
+ <td>645</td>
84
+ <td>78.0</td>
85
+ <td>88.4</td>
86
+ <td>63.5</td>
87
+ <td>1771.5</td>
88
+ <td>77.0</td>
89
+ <td>74.4</td>
90
+ <td>53.8</td>
91
+ <td>47.8</td>
92
+ <td>93.1</td>
93
+ <td>63.0</td>
94
+ <td>86.4</td>
95
+ </tr>
96
+ <tr>
97
+ <td colspan="14" align="left"><strong>Open-source</strong></td>
98
+ </tr>
99
+ <tr>
100
+ <td nowrap="nowrap" align="left">Mini-Gemini</td>
101
+ <td>2.2B</td>
102
+ <td>-</td>
103
+ <td>56.2</td>
104
+ <td>34.2*</td>
105
+ <td>-</td>
106
+ <td>1653.0</td>
107
+ <td>-</td>
108
+ <td>-</td>
109
+ <td>31.7</td>
110
+ <td>-</td>
111
+ <td>-</td>
112
+ <td>-</td>
113
+ <td>-</td>
114
+ </tr>
115
+ <tr>
116
+ <td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
117
+ <td>9.6B</td>
118
+ <td>488</td>
119
+ <td>61.5</td>
120
+ <td>62.6</td>
121
+ <td>51.6</td>
122
+ <td>1860.0</td>
123
+ <td>61.8</td>
124
+ <td>56.3</td>
125
+ <td>37.0</td>
126
+ <td>33.8</td>
127
+ <td>67.7</td>
128
+ <td>49.3</td>
129
+ <td>56.2</td>
130
+ </tr>
131
+ <tr>
132
+ <td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
133
+ <td>7.3B</td>
134
+ <td>435</td>
135
+ <td>64.7*</td>
136
+ <td>47.0*</td>
137
+ <td>54.6</td>
138
+ <td>1765.4</td>
139
+ <td>73.8</td>
140
+ <td>71.4</td>
141
+ <td>38.3</td>
142
+ <td>36.8</td>
143
+ <td>77.8</td>
144
+ <td>54.2</td>
145
+ <td>-</td>
146
+ </tr>
147
+ <tr>
148
+ <td nowrap="nowrap" align="left">Yi-VL-34B</td>
149
+ <td>34B</td>
150
+ <td>290</td>
151
+ <td>43.4*</td>
152
+ <td>16.9*</td>
153
+ <td>52.2</td>
154
+ <td><strong>2050.2</strong></td>
155
+ <td>72.4</td>
156
+ <td>70.7</td>
157
+ <td>45.1</td>
158
+ <td>30.7</td>
159
+ <td>62.3</td>
160
+ <td>54.8</td>
161
+ <td>79.3</td>
162
+ </tr>
163
+ <tr>
164
+ <td nowrap="nowrap" align="left">CogVLM-Chat</td>
165
+ <td>17.4B</td>
166
+ <td>590</td>
167
+ <td>70.4</td>
168
+ <td>33.3*</td>
169
+ <td>54.2</td>
170
+ <td>1736.6</td>
171
+ <td>65.8</td>
172
+ <td>55.9</td>
173
+ <td>37.3</td>
174
+ <td>34.7</td>
175
+ <td>73.9</td>
176
+ <td>60.3</td>
177
+ <td>73.6</td>
178
+ </tr>
179
+ <tr>
180
+ <td nowrap="nowrap" align="left">TextMonkey</td>
181
+ <td>9.7B</td>
182
+ <td>558</td>
183
+ <td>64.3</td>
184
+ <td>66.7</td>
185
+ <td>-</td>
186
+ <td>-</td>
187
+ <td>-</td>
188
+ <td>-</td>
189
+ <td>-</td>
190
+ <td>-</td>
191
+ <td>-</td>
192
+ <td>-</td>
193
+ <td>-</td>
194
+ </tr>
195
+ <tr>
196
+ <td nowrap="nowrap" align="left">Idefics2</td>
197
+ <td>8.0B</td>
198
+ <td>-</td>
199
+ <td>73.0</td>
200
+ <td>74.0</td>
201
+ <td>57.2</td>
202
+ <td>1847.6</td>
203
+ <td>75.7</td>
204
+ <td>68.6</td>
205
+ <td>45.2</td>
206
+ <td>52.2</td>
207
+ <td>49.1</td>
208
+ <td>60.7</td>
209
+ <td>-</td>
210
+ </tr>
211
+ <tr>
212
+ <td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
213
+ <td>8.4B</td>
214
+ <td>-</td>
215
+ <td>-</td>
216
+ <td>-</td>
217
+ <td>54.3</td>
218
+ <td>1920.3</td>
219
+ <td>77.0</td>
220
+ <td>73.9</td>
221
+ <td>41.3</td>
222
+ <td>31.5</td>
223
+ <td>61.2</td>
224
+ <td>58.8</td>
225
+ <td>-</td>
226
+ </tr>
227
+ <tr>
228
+ <td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
229
+ <td>8.4B</td>
230
+ <td>-</td>
231
+ <td>-</td>
232
+ <td>78.2</td>
233
+ <td>-</td>
234
+ <td>1971.5</td>
235
+ <td>-</td>
236
+ <td>-</td>
237
+ <td>41.7</td>
238
+ <td>37.5</td>
239
+ <td>80.1</td>
240
+ <td>60.0</td>
241
+ <td>-</td>
242
+ </tr>
243
+ <tr>
244
+ <td nowrap="nowrap" align="left">Phi-3-vision-128k-instruct</td>
245
+ <td>4.2B</td>
246
+ <td>639*</td>
247
+ <td>70.9</td>
248
+ <td>-</td>
249
+ <td>-</td>
250
+ <td>1537.5*</td>
251
+ <td>-</td>
252
+ <td>-</td>
253
+ <td>40.4</td>
254
+ <td>44.5</td>
255
+ <td>64.2*</td>
256
+ <td>58.8*</td>
257
+ <td>-</td>
258
+ </tr>
259
+ <tr style="background-color: #e6f2ff;">
260
+ <td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
261
+ <td>2.8B</td>
262
+ <td>366</td>
263
+ <td>60.6</td>
264
+ <td>38.2</td>
265
+ <td>47.5</td>
266
+ <td>1650.2</td>
267
+ <td>64.1</td>
268
+ <td>62.6</td>
269
+ <td>38.3</td>
270
+ <td>28.9</td>
271
+ <td>51.3</td>
272
+ <td>51.2</td>
273
+ <td>78.4</td>
274
+ </tr>
275
+ <tr style="background-color: #e6f2ff;">
276
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
277
+ <td>2.8B</td>
278
+ <td>605</td>
279
+ <td>74.1</td>
280
+ <td>71.9</td>
281
+ <td>54.5</td>
282
+ <td>1808.6</td>
283
+ <td>69.1</td>
284
+ <td>66.5</td>
285
+ <td>38.2</td>
286
+ <td>38.7</td>
287
+ <td>69.2</td>
288
+ <td>55.8</td>
289
+ <td>85.5</td>
290
+ </tr>
291
+ <tr style="background-color: #e6f2ff;">
292
+ <td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
293
+ <td>8.5B</td>
294
+ <td><strong>725</strong></td>
295
+ <td><strong>76.6</strong></td>
296
+ <td><strong>84.8</strong></td>
297
+ <td><strong>65.1</strong></td>
298
+ <td>2024.6</td>
299
+ <td><strong>77.2</strong></td>
300
+ <td><strong>74.2</strong></td>
301
+ <td><strong>45.8</strong></td>
302
+ <td><strong>54.3</strong></td>
303
+ <td><strong>86.7</strong></td>
304
+ <td><strong>63.5</strong></td>
305
+ <td><strong>89.7</strong></td>
306
+ </tr>
307
+ </tbody>
308
+ </table>
309
 
310
 
311
+ Evaluation results of multilingual LLaVA Bench
312
  <div align="center">
313
  <img src="assets/llavabench_compare.png" width="80%" />
314
  </div>
 
340
 
341
 
342
  ## Demo
343
+ Click here to try out the Demo of [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
344
 
345
  ## Deployment on Mobile Phone
346
  Coming soon.