alexshengzhili commited on
Commit
8e30910
1 Parent(s): 2d96d08

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Name: llava-v1.5-13b-dpo
2
+
3
+ **Release Arxiv:** [https://arxiv.org/abs/2402.10884](https://arxiv.org/abs/2402.10884)
4
+
5
+ **Developers:** Shengzhi Li (TIFIN), Rongyu Lin (KAUST), Shichao Pei (University of Massachusetts Boston)
6
+ **Affiliations:** TIFIN, KAUST, University of Massachusetts Boston
7
+ **Contact Information:** alex.li@tifin.com, rongyu.lin@kaust.edu.sa, shichao.pei@umb.edu
8
+
9
+ ## Overview
10
+ The llava-v1.5-13b-dpo model is designed to enhance the instruction-following capabilities of multi-modal large language models (MLLMs), particularly in scenarios where visual instruction tuning might degrade language proficiency. This model leverages a novel Direct Preference Optimization (DPO) method, along with a curated 6K-entry VQA preference dataset, to achieve superior performance on multi-modal tasks and benchmarks.
11
+
12
+ ## Intended Use
13
+ - **Primary Applications:** This model is intended for tasks requiring the integration of text and image modalities, including but not limited to visual question answering (VQA), image captioning, and multi-modal instruction following.
14
+ - **Target Audience:** Researchers and practitioners in the fields of natural language processing, computer vision, and multi-modal AI.
15
+
16
+ ## Training Data
17
+ The MM-LLM-DPO model was trained using a lightweight (6k entries) VQA preference dataset, where answers were annotated for 5 quality metrics in a granular fashion. The dataset was designed to address the diversity and complexity gap typically observed in VQA datasets.
18
+
19
+ ## Evaluation
20
+ The model demonstrates significant improvements over baseline models like Vicuna and LLaVA on various benchmarks:
21
+ - **MT-Bench:** Achieved a score of 6.73, surpassing Vicuna's 6.57 and LLaVA's 5.99.
22
+ - **Visual Instruction Performance:** Recorded a +4.9% improvement on MM-Vet and +6% on LLaVA-Bench.
23
+
24
+ | Model Name | MM-Vet | LLaVA-bench | PoPe | MM-Bench | MT-bench | AlpacaEval |
25
+ |------------|--------|-------------|-------|----------|----------|------------|
26
+ | **Vicuna-1.5-13b [16]** | - | - | - | - | 6.57 | 81.4 |
27
+ | **LLaVA-1.5-13b [10]** | 36.3 | 73.1 | 0.859 | 67.4 | 5.99 | 79.3 |
28
+ | **LLaVA-RLHF-13b [23]**| 37.2 | 76.8 | 0.869 | 60.1 | 6.18 | 81.0 |
29
+ | **Standard SFT** | 36.5 | 63.7 | 0.850 | 65.4 | 5.01 | 50.2 |
30
+ | **SteerLM** | 35.2 | 67.0 | 0.878 | 65.1 | 5.70 | 68.8 |
31
+ | **Rejection-sampling** | 38.0 | 70.6 | 0.883 | 67.6 | 6.22 | 74.9 |
32
+ | **llava-v1.5-13b-dpo** | 41.2 | 79.1 | 0.870 | 66.8 | 6.73 | 86.4 |
33
+
34
+ *We applied the last four Standard sft, SteerLM, Rejection Sampling and DPO, and found DPO to be most performant
35
+ ## Ethical Considerations
36
+ This model was developed with a focus on mitigating modality conflict and catastrophic forgetting in MLLMs. Users are encouraged to consider the potential biases and limitations inherent in the training data and model outputs, especially when deploying the model in diverse and sensitive contexts.
37
+
38
+ ## Limitations
39
+ - The model's training dataset, while addressing key gaps in VQA datasets, is relatively small at 6k entries. This may limit the model's generalizability across broader or more diverse multi-modal tasks.
40
+ - Performance enhancements, particularly in language instruction capabilities post-visual tuning, are based on the current scope of evaluated benchmarks and datasets. The model's efficacy may vary in different or more challenging contexts.
41
+
42
+ ## Acknowledgments
43
+ This work was made possible through the contributions of Shengzhi Li, Rongyu Lin, and Shichao Pei, and supported by their respective institutions.
44
+
45
+ ## Citation
46
+ Please cite this work as:
47
+
48
+ ```bibtex
49
+ @misc{li2024multimodal,
50
+ title={Multi-modal preference alignment remedies regression of visual instruction tuning on language model},
51
+ author={Shengzhi Li and Rongyu Lin and Shichao Pei},
52
+ year={2024},
53
+ eprint={2402.10884},
54
+ archivePrefix={arXiv},
55
+ primaryClass={cs.CL}
56
+ }