File size: 4,285 Bytes
0b41538
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

### Introduction
Paper: [Paper](https://arxiv.org/abs/2502.18411),

Github: [Github](https://github.com/PhoenixZ810/OmniAlign-V),

Page: [Page](https://phoenixz810.github.io/OmniAlign-V/),

SFT Dataset: [OmniAlign-V](https://huggingface.co/datasets/PhoenixZ/OmniAlign-V),

DPO Dataset: [OmniAlign-V-DPO](https://huggingface.co/datasets/PhoenixZ/OmniAlign-V-DPO),

MM-AlignBench: [MM-AlignBench](https://github.com/open-compass/VLMEvalKit)

Checkpoints: [LLaVANext-OA-7B](https://huggingface.co/PhoenixZ/LLaVANext-OmniAlign-7B), [LLaVANext-OA-32B](https://huggingface.co/PhoenixZ/LLaVANext-OmniAlign-32B), [LLaVANext-OA-32B-DPO](https://huggingface.co/PhoenixZ/LLaVANext-OmniAlign-32B-DPO)

This is the official repo of LLaVANext-OmniAlign(OA)-7B in OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference.

LLaVANext-OmniAlign-7B is based on [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT) structure with [InternLM2.5-7B-chat](https://huggingface.co/internlm/internlm2_5-7b-chat).

By combining LLaVA-Next-SFT-738k-multimodal and OmniAlign-V datasets, we can significantly improve the alignment of MLLMs with human preference and enhance the performance of MLLMs on common downstream tasks, especially on MMVet and MMMU.

### Performance
By integrating OmniAlign-V datasets in Supervised Fine-tuning(SFT) stage, we can not only significantly improve the alignment of MLLMs with human preference, but also enhance the performance of MLLMs on common downstream tasks, especially on MMVet and MMMU.

| Model | Data       | LLM   | MM-AlignBench          | WildVision            | MIA-Bench       | MMVet          | MMMU           | MMBenchV1.1    | AI2D           | OCRBench       |
|----------------|---------------------|----------------|---------------------------------|--------------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
| LLaVA          | LLaVANext-778k      | InternLM2.5-7B | 3.6 / -82.1                     | 18.4 / -55.1                   | 75.4                     | 41.2                    | 42.6                    | 73.6                    | 74.1                    | 39.7                    |
| LLaVA          | OmniAlign-V_mix | InternLM2.5-7B | 50.0 / +3.8                     | 28.2 / -34.6                   | 85.4                     | 43.5                    | 43.3                    | 73.7                    | 74.7                    | 41.3                    |
|                |                     |                | + 46.4 / 85.9 | + 9.8 / 20.5 | + 10.0 | + 2.3 | + 0.7 | + 0.1 | + 0.6 | + 1.6 |
| LLaVANext      | LLaVANext-778k      | InternLM2.5-7B | 20.6 / -42.7                    | 23.4 / -45.0                   | 76.9                     | 41.8                    | 44.1                    | 75.1                    | 74.7                    | 56.2                    |
| LLaVANext      | OmniAlign-V_mix | InternLM2.5-7B | 57.1 / +11.1                    | 29.6 / -31.3                   | 86.7                     | 47.7                    | 46.8                    | 74.9                    | 77.5                    | 58.9                    |
|                |                     |                | + 36.5 / 53.8 | + 6.2 / 13.7 | + 9.8  | + 5.9 | + 2.7 | - 0.2 | + 2.8 | + 2.7 |
| LLaVANext      | LLaVANext-778k      | Qwen2.5-32B    | 26.6 / -29.0                    | 25.2 / -41.3                   | 86.0                     | 47.7                    | 55.2                    | 79.3                    | 79.6                    | 55.9                    |
| LLaVANext      | OmniAlign-V_mix | Qwen2.5-32B    | 62.3 / +19.4                    | 40.2 / -14.9                   | 89.6                     | 56.9                    | 60.7                    | 80.6                    | 81.7                    | 55.9                    |
|                |                     |                | + 35.7 / 48.4 | + 15.0/26.4  | + 3.6  | + 9.2 | + 5.5 | + 1.3 | + 2.1 | + 0.0 |

For MM-AlignBench and WildVision, A/B denotes Winning Rate/Reward.
### How to use
Please refer to our [Github](https://github.com/PhoenixZ810/OmniAlign-V) for more details about training and evaluation.