KerwinJob commited on
Commit
9544b88
β€’
1 Parent(s): 990f01b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -6
README.md CHANGED
@@ -1,6 +1,6 @@
1
- # MAmmoTH-VL: liciting Multimodal Reasoning with Instruction Tuning at Scale
2
 
3
- [🏠 Homepage](https://neulab.github.io/Pangea/) | [πŸ€– MAmmoTH-VL-8B](https://huggingface.co/neulab/Pangea-7B) | [πŸ’» Code](https://github.com/neulab/Pangea/tree/main) | [πŸ“„ Arxiv](https://arxiv.org/abs/2410.16153) | [πŸ“• PDF](https://arxiv.org/pdf/2410.16153) | [πŸ–₯️ Demo](https://huggingface.co/spaces/neulab/Pangea)
4
 
5
  # Abstract
6
  Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.
@@ -8,7 +8,8 @@ To address these challenges, we introduce a scalable and cost-effective method t
8
 
9
 
10
  # Performance
11
- ## Multi-Discipline Knowledge and Mathematical Reasoning Model Comparison
 
12
 
13
  | Model | MMStar | MMMU | MMMU-Pro | SeedBench | MMBench | MMVet | MathVerse | MathVista |
14
  |-------|--------|------|-----------|------------|----------|--------|------------|------------|
@@ -16,20 +17,64 @@ To address these challenges, we introduce a scalable and cost-effective method t
16
  | Gemini-1.5-Pro | 59.1 | 65.8 | 44.4 | 76.0 | 73.9 | 64.0 | - | 63.9 |
17
  | Claude-3.5-Sonnet | 62.2 | 68.3 | 48.0 | 72.2 | 79.7 | 75.4 | - | 67.7 |
18
  | InternVL2-LLaMa3-76B | 67.1 | 58.2 | 38.0 | 77.6 | 86.5 | 64.4 | - | 65.5 |
19
- | Qwen2-VL-72B | 68.6 | 64.5 | 37.1 | 77.9 | 86.9 | 73.9 | 37.3 | 70.5 |
20
  | LLaVA-OV-72B (SI) | 65.2 | 57.4 | 26.0 | 77.6 | 86.6 | 60.0 | 37.7 | 66.5 |
21
  | LLaVA-OV-72B | 66.1 | 56.8 | 24.0 | 78.0 | 85.9 | 63.7 | 39.1 | 67.5 |
22
  | MiniCPM-V-2.6-8B | 57.5 | 49.8 | 21.7 | 74.0 | 81.5 | 60.0 | - | 60.6 |
23
  | InternLM-XComp-2.5-8B | 59.9 | 42.9 | - | 75.4 | 74.4 | 51.7 | 20.0 | 59.6 |
 
24
  | InternVL-2-8B | 59.4 | 49.3 | 25.4 | 76.0 | 81.7 | 60.0 | 27.5 | 58.3 |
25
- | Qwen2-VL-7B | 60.7 | 52.1 | 26.9 | 74.3 | 83.0 | 62.0 | 28.2 | 58.2 |
26
  | Cambrian-1-8B | - | 42.7 | 14.7 | 73.3 | 74.6 | 48.0 | - | 49.0 |
 
27
  | Molmo-8B-D | 50.5 | 45.3 | 18.9 | 74.1 | 73.6 | 58.0 | 21.5 | 51.6 |
28
  | LLaVA-OV-7B (SI) | 60.9 | 47.3 | 16.8 | 74.8 | 80.5 | 58.8 | 26.9 | 56.1 |
29
  | LLaVA-OV-7B | 61.7 | 48.8 | 18.7 | 75.4 | 80.8 | 58.6 | 26.2 | 63.2 |
30
  | model-8B (SI) | 55.4 | 49.4 | 26.0 | 73.3 | 83.0 | 60.6 | 35.0 | 67.6 |
31
  | model-8B | 63.0 | 50.8 | 25.3 | 76.0 | 83.4 | 62.3 | 34.2 | 67.6 |
32
- | Over Best Open-Source (7-8B Scale) | +1.3 | +2.0 | +7.1 | +0.6 | +2.6 | +3.7 | +8.1 | +4.4 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
 
35
  ## Citing the Model
 
1
+ # MAmmoTH-VL-8B
2
 
3
+ [🏠 Homepage](https://mammoth-vl.github.io/) | [πŸ€– MAmmoTH-VL-8B](https://huggingface.co/MMSFT/MAmmoTH-VL-8B) | [πŸ’» Code](https://github.com/MAmmoTH-VL/MAmmoTH-VL) | [πŸ“„ Arxiv](https://arxiv.org/abs/2410.16153) | [πŸ“• PDF](https://arxiv.org/pdf/2410.16153) | [πŸ–₯️ Demo](https://huggingface.co/spaces/MMSFT/MAmmoTH-VL-8B)
4
 
5
  # Abstract
6
  Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.
 
8
 
9
 
10
  # Performance
11
+
12
+ ## Multi-Discipline Knowledge and Mathematical Reasoning
13
 
14
  | Model | MMStar | MMMU | MMMU-Pro | SeedBench | MMBench | MMVet | MathVerse | MathVista |
15
  |-------|--------|------|-----------|------------|----------|--------|------------|------------|
 
17
  | Gemini-1.5-Pro | 59.1 | 65.8 | 44.4 | 76.0 | 73.9 | 64.0 | - | 63.9 |
18
  | Claude-3.5-Sonnet | 62.2 | 68.3 | 48.0 | 72.2 | 79.7 | 75.4 | - | 67.7 |
19
  | InternVL2-LLaMa3-76B | 67.1 | 58.2 | 38.0 | 77.6 | 86.5 | 64.4 | - | 65.5 |
20
+ | Qwen2-VL-72B-Ins | 68.6 | 64.5 | 37.1 | 77.9 | 86.9 | 73.9 | 37.3 | 70.5 |
21
  | LLaVA-OV-72B (SI) | 65.2 | 57.4 | 26.0 | 77.6 | 86.6 | 60.0 | 37.7 | 66.5 |
22
  | LLaVA-OV-72B | 66.1 | 56.8 | 24.0 | 78.0 | 85.9 | 63.7 | 39.1 | 67.5 |
23
  | MiniCPM-V-2.6-8B | 57.5 | 49.8 | 21.7 | 74.0 | 81.5 | 60.0 | - | 60.6 |
24
  | InternLM-XComp-2.5-8B | 59.9 | 42.9 | - | 75.4 | 74.4 | 51.7 | 20.0 | 59.6 |
25
+ | Llama-3.2-11B-Vision-Ins | 49.8 | 50.7 | 23.7 | 72.7 | 73.2 | 57.6 | 23.6 | 51.5 |
26
  | InternVL-2-8B | 59.4 | 49.3 | 25.4 | 76.0 | 81.7 | 60.0 | 27.5 | 58.3 |
27
+ | Qwen2-VL-7B-Ins | 60.7 | 52.1 | 26.9 | 74.3 | 83.0 | 62.0 | 28.2 | 58.2 |
28
  | Cambrian-1-8B | - | 42.7 | 14.7 | 73.3 | 74.6 | 48.0 | - | 49.0 |
29
+ | Llava-CoT-11B | 57.6 | 48.9 | 18.5 | 75.2 | 75.0 | 60.3 | 24.2 | 54.8 |
30
  | Molmo-8B-D | 50.5 | 45.3 | 18.9 | 74.1 | 73.6 | 58.0 | 21.5 | 51.6 |
31
  | LLaVA-OV-7B (SI) | 60.9 | 47.3 | 16.8 | 74.8 | 80.5 | 58.8 | 26.9 | 56.1 |
32
  | LLaVA-OV-7B | 61.7 | 48.8 | 18.7 | 75.4 | 80.8 | 58.6 | 26.2 | 63.2 |
33
  | model-8B (SI) | 55.4 | 49.4 | 26.0 | 73.3 | 83.0 | 60.6 | 35.0 | 67.6 |
34
  | model-8B | 63.0 | 50.8 | 25.3 | 76.0 | 83.4 | 62.3 | 34.2 | 67.6 |
35
+ | Over Best Open-Source (~10B Scale) | +1.3 | +1.9 | +7.1 | +0.6 | +2.6 | +2.0 | +8.1 | +4.4 |
36
+
37
+
38
+
39
+ ## Chart & Doc Understanding and Multimodal Interactions & Preferences
40
+
41
+ | Model | AI2D test | ChartQA test | InfoVQA test | DocVQA test | RealWorldQA test | WildVision 0617 | L-Wilder small |
42
+ |-------|------------|--------------|--------------|-------------|------------------|-----------------|----------------|
43
+ | GPT-4o | 94.2 | 85.7 | 79.2 | 92.8 | 76.5 | 89.4 | 85.9 |
44
+ | Gemini-1.5-Pro | 94.4 | 87.2 | 81.0 | 93.1 | 70.4 | - | - |
45
+ | Claude-3.5-Sonnet | 94.7 | 90.8 | 49.7 | 95.2 | 59.9 | 50.0 | 83.1 |
46
+ | InternVL2-LLaMa3-76B | 88.4 | 88.4 | 82.0 | 94.1 | 72.7 | - | - |
47
+ | Qwen2-VL-72B-Ins | 88.1 | 88.3 | 84.5 | 96.5 | 77.8 | - | - |
48
+ | LLaVA-OV-72B (SI) | 85.1 | 84.9 | 74.6 | 91.8 | 73.8 | 49.5 | 72.9 |
49
+ | LLaVA-OV-72B | 85.6 | 83.7 | 74.9 | 91.3 | 71.9 | 52.3 | 72.0 |
50
+ | MiniCPM-V-2.6-7B | 82.1 | 82.4 | - | 90.8 | 65.0 | 11.7 | - |
51
+ | InternLM-XComp-2.5-7B | 81.5 | 82.2 | 70.0 | 90.9 | 67.8 | - | 61.4 |
52
+ | Llama-3.2-11B-Vision-Ins | 77.3 | 83.4 | 65.0 | 88.4 | 63.3 | 49.7 | 62.0 |
53
+ | InternVL-2-8B | 83.8 | 83.3 | 74.8 | 91.6 | 64.4 | 51.5 | 62.5 |
54
+ | Qwen2-VL-7B-Ins | 83.0 | 83.0 | 76.5 | 94.5 | 70.1 | 44.0 | 66.3 |
55
+ | Cambrian-1-8B | 73.3 | 73.3 | 41.6 | 77.8 | 64.2 | - | - |
56
+ | Llava-CoT-11B | - | 67.0 | 44.8 | - | - | - | 65.3 |
57
+ | Molmo-7B-D | 81.0 | 84.1 | 72.6 | 92.2 | 70.7 | 40.0 | - |
58
+ | LLaVA-OV-7B (SI) | 81.6 | 78.8 | 65.3 | 86.9 | 65.5 | 39.2 | 69.1 |
59
+ | LLaVA-OV-7B | 81.4 | 80.0 | 68.8 | 87.5 | 66.3 | 53.8 | 67.8 |
60
+ | MAmmoTH-VL-8B (SI) | 83.4 | 85.9 | 74.8 | 93.8 | 71.3 | 51.9 | 71.3 |
61
+ | MAmmoTH-VL-8B | 84.0 | 86.2 | 73.1 | 93.7 | 69.9 | 51.1 | 70.8 |
62
+ | Over Best Open-Source (~10B Scale) | +2.4 | +2.1 | +2.2 | +1.6 | +0.6 | -1.9 | +2.2 |
63
+
64
+ ## Multi-Image and Video
65
+
66
+ | Model | MuirBench test | MEGABench test | EgoSchema test | PerceptionTest test | SeedBench video | MLVU dev | MVBench test | VideoMME w/o subs |
67
+ |-------|----------------|-----------------|----------------|-------------------|-----------------|-----------|--------------|------------------|
68
+ | GPT-4o | 68.0 | 54.2 | - | - | - | 64.6 | - | 71.9 |
69
+ | GPT-4v | 62.3 | - | - | - | 60.5 | 49.2 | 43.5 | 59.9 |
70
+ | LLaVA-OV-72B (SI) | 33.2 | - | 58.6 | 62.3 | 60.9 | 60.9 | 57.1 | 64.8 |
71
+ | LLaVA-OV-72B | 54.8 | 33.8 | 62.0 | 66.9 | 62.1 | 66.4 | 59.4 | 66.2 |
72
+ | InternVL-2-8B | 59.4 | 27.7 | 54.2 | 57.4 | 54.9 | 30.2 | 66.4 | 54.0 |
73
+ | Qwen2-VL-7B-Ins | 41.6 | 36.0 | 66.7 | 62.3 | 55.3 | 58.6 | 67.0 | 63.3 |
74
+ | LLaVA-OV-7B (SI) | 32.7 | 22.1 | 52.9 | 54.9 | 51.1 | 60.2 | 51.2 | 55.0 |
75
+ | LLaVA-OV-7B | 41.8 | 23.9 | 60.1 | 57.1 | 56.9 | 64.7 | 56.7 | 58.2 |
76
+ | MAmmoTH-VL-8B | 55.1 | 28.2 | 58.5 | 59.3 | 57.1 | 64.7 | 59.1 | 58.8 |
77
+ | Over Best Open-Source (~10B Scale) | +13.3 | +4.3 | -1.6 | +2.2 | +0.2 | +0 | +2.4 | +0.6 |
78
 
79
 
80
  ## Citing the Model