Shanshan Wang commited on
Commit
3ff7e45
โ€ข
1 Parent(s): 8f8b827

updated readme

Browse files
Files changed (1) hide show
  1. README.md +30 -19
README.md CHANGED
@@ -13,6 +13,12 @@ thumbnail: >-
13
  pipeline_tag: text-generation
14
  ---
15
  # Model Card
 
 
 
 
 
 
16
  The H2OVL-Mississippi-2B is a high-performing, general-purpose vision-language model developed by H2O.ai to handle a wide range of multimodal tasks. This model, with 2 billion parameters, excels in tasks such as image captioning, visual question answering (VQA), and document understanding, while maintaining efficiency for real-world applications.
17
 
18
  The Mississippi-2B model builds on the strong foundations of our H2O-Danube language models, now extended to integrate vision and language tasks. It competes with larger models across various benchmarks, offering a versatile and scalable solution for document AI, OCR, and multimodal reasoning.
@@ -30,7 +36,29 @@ The Mississippi-2B model builds on the strong foundations of our H2O-Danube lang
30
  - Optimized for Vision-Language Tasks: Achieves high performance across a wide range of applications, including document AI, OCR, and multimodal reasoning.
31
  - Comprehensive Dataset: Trained on 17M image-text pairs, ensuring broad coverage and strong task generalization.
32
 
33
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ### Install dependencies:
36
  ```bash
@@ -42,7 +70,7 @@ If you have ampere GPUs, install flash-attention to speed up inference:
42
  pip install flash_attn
43
  ```
44
 
45
- ### Sample demo:
46
 
47
  ```python
48
  import torch
@@ -86,23 +114,6 @@ print(f'User: {question}\nAssistant: {response}')
86
 
87
  ```
88
 
89
- ## Benchmarks
90
-
91
- ### Performance Comparison of Similar Sized Models Across Multiple Benchmarks - OpenVLM Leaderboard
92
-
93
- | **Models** | **Params (B)** | **Avg. Score** | **MMBench** | **MMStar** | **MMMU<sub>VAL</sub>** | **Math Vista** | **Hallusion** | **AI2D<sub>TEST</sub>** | **OCRBench** | **MMVet** |
94
- |----------------------------|----------------|----------------|-------------|------------|-----------------------|----------------|---------------|-------------------------|--------------|-----------|
95
- | Qwen2-VL-2B | 2.1 | **57.2** | **72.2** | 47.5 | 42.2 | 47.8 | **42.4** | 74.7 | **797** | **51.5** |
96
- | **H2OVL-Mississippi-2B** | 2.1 | 54.4 | 64.8 | 49.6 | 35.2 | **56.8** | 36.4 | 69.9 | 782 | 44.7 |
97
- | InternVL2-2B | 2.1 | 53.9 | 69.6 | **49.8** | 36.3 | 46.0 | 38.0 | 74.1 | 781 | 39.7 |
98
- | Phi-3-Vision | 4.2 | 53.6 | 65.2 | 47.7 | **46.1** | 44.6 | 39.0 | **78.4** | 637 | 44.1 |
99
- | MiniMonkey | 2.2 | 52.7 | 68.9 | 48.1 | 35.7 | 45.3 | 30.9 | 73.7 | **794** | 39.8 |
100
- | MiniCPM-V-2 | 2.8 | 47.9 | 65.8 | 39.1 | 38.2 | 39.8 | 36.1 | 62.9 | 605 | 41.0 |
101
- | InternVL2-1B | 0.8 | 48.3 | 59.7 | 45.6 | 36.7 | 39.4 | 34.3 | 63.8 | 755 | 31.5 |
102
- | PaliGemma-3B-mix-448 | 2.9 | 46.5 | 65.6 | 48.3 | 34.9 | 28.7 | 32.2 | 68.3 | 614 | 33.1 |
103
- | **H2OVL-Mississippi-0.8B** | 0.8 | 43.5 | 47.7 | 39.1 | 34.0 | 39.0 | 29.6 | 53.6 | 751 | 30.0 |
104
- | DeepSeek-VL-1.3B | 2.0 | 39.6 | 63.8 | 39.9 | 33.8 | 29.8 | 27.6 | 51.5 | 413 | 29.2 |
105
-
106
 
107
  ## Prompt Engineering for JSON Extraction
108
 
 
13
  pipeline_tag: text-generation
14
  ---
15
  # Model Card
16
+ [\[๐Ÿ“œ H2OVL-Mississippi Paper\]](https://arxiv.org/abs/2410.13611)
17
+ [\[๐Ÿค— HF Demo\]](https://huggingface.co/spaces/h2oai/h2ovl-mississippi)
18
+ [\[๐Ÿš€ Quick Start\]](#quick-start)
19
+
20
+
21
+
22
  The H2OVL-Mississippi-2B is a high-performing, general-purpose vision-language model developed by H2O.ai to handle a wide range of multimodal tasks. This model, with 2 billion parameters, excels in tasks such as image captioning, visual question answering (VQA), and document understanding, while maintaining efficiency for real-world applications.
23
 
24
  The Mississippi-2B model builds on the strong foundations of our H2O-Danube language models, now extended to integrate vision and language tasks. It competes with larger models across various benchmarks, offering a versatile and scalable solution for document AI, OCR, and multimodal reasoning.
 
36
  - Optimized for Vision-Language Tasks: Achieves high performance across a wide range of applications, including document AI, OCR, and multimodal reasoning.
37
  - Comprehensive Dataset: Trained on 17M image-text pairs, ensuring broad coverage and strong task generalization.
38
 
39
+
40
+ ## Benchmarks
41
+
42
+ ### Performance Comparison of Similar Sized Models Across Multiple Benchmarks - OpenVLM Leaderboard
43
+
44
+ | **Models** | **Params (B)** | **Avg. Score** | **MMBench** | **MMStar** | **MMMU<sub>VAL</sub>** | **Math Vista** | **Hallusion** | **AI2D<sub>TEST</sub>** | **OCRBench** | **MMVet** |
45
+ |----------------------------|----------------|----------------|-------------|------------|-----------------------|----------------|---------------|-------------------------|--------------|-----------|
46
+ | Qwen2-VL-2B | 2.1 | **57.2** | **72.2** | 47.5 | 42.2 | 47.8 | **42.4** | 74.7 | **797** | **51.5** |
47
+ | **H2OVL-Mississippi-2B** | 2.1 | 54.4 | 64.8 | 49.6 | 35.2 | **56.8** | 36.4 | 69.9 | 782 | 44.7 |
48
+ | InternVL2-2B | 2.1 | 53.9 | 69.6 | **49.8** | 36.3 | 46.0 | 38.0 | 74.1 | 781 | 39.7 |
49
+ | Phi-3-Vision | 4.2 | 53.6 | 65.2 | 47.7 | **46.1** | 44.6 | 39.0 | **78.4** | 637 | 44.1 |
50
+ | MiniMonkey | 2.2 | 52.7 | 68.9 | 48.1 | 35.7 | 45.3 | 30.9 | 73.7 | **794** | 39.8 |
51
+ | MiniCPM-V-2 | 2.8 | 47.9 | 65.8 | 39.1 | 38.2 | 39.8 | 36.1 | 62.9 | 605 | 41.0 |
52
+ | InternVL2-1B | 0.8 | 48.3 | 59.7 | 45.6 | 36.7 | 39.4 | 34.3 | 63.8 | 755 | 31.5 |
53
+ | PaliGemma-3B-mix-448 | 2.9 | 46.5 | 65.6 | 48.3 | 34.9 | 28.7 | 32.2 | 68.3 | 614 | 33.1 |
54
+ | **H2OVL-Mississippi-0.8B** | 0.8 | 43.5 | 47.7 | 39.1 | 34.0 | 39.0 | 29.6 | 53.6 | 751 | 30.0 |
55
+ | DeepSeek-VL-1.3B | 2.0 | 39.6 | 63.8 | 39.9 | 33.8 | 29.8 | 27.6 | 51.5 | 413 | 29.2 |
56
+
57
+
58
+
59
+ ## Quick Start
60
+
61
+ We provide an example code to run h2ovl-mississippi-2b using `transformers`.
62
 
63
  ### Install dependencies:
64
  ```bash
 
70
  pip install flash_attn
71
  ```
72
 
73
+ ### Inference with Transformers:
74
 
75
  ```python
76
  import torch
 
114
 
115
  ```
116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  ## Prompt Engineering for JSON Extraction
119