File size: 5,452 Bytes
bc4a7b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a9fe20
bc4a7b7
 
 
 
8420b3a
 
bc4a7b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: apache-2.0
datasets:
- liuhaotian/LLaVA-Pretrain
- liuhaotian/LLaVA-Instruct-150K
language:
- zh
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen1.5-7B-Chat
- google/siglip-so400m-patch14-384
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Wings
- Multimodal-LLM
- Text-only-Forgetting
- NeurIPS2024
---

<p align="left">
    &nbspEnglish&nbsp | &nbsp; <a href="README_CN.md">中文</a>
</p>
<br>
<br>
<div align="center">

<picture>
  <img alt="Wings logo" src="https://raw.githubusercontent.com/AIDC-AI/Wings/main/assets/images/logo.png" width="550px">
</picture>
</br>
</br>

</div>

<div id="top"></div>  

<div align="center">
  <h3 align="center">Wings: A Versatile Multimodal LLM without Text-only Forgetting</h3>
</div>
<p align="center">
📝 <a href="https://arxiv.org/abs/2406.03496" target="_blank">Paper</a> • 🤗 <a href="https://huggingface.co/AIDC-AI/Wings-Qwen1_5-8B" target="_blank">Hugging Face</a>
</a>
</p> 

<p align="center">
    🚀 Ask questions or discuss ideas on <a href="https://github.com/AIDC-AI/Wings/discussions" target="_blank"> GitHub </a>
</p>

⚠️ We apologize for any inconvenience. Currently, Wings can only be loaded through the [raw method](https://github.com/AIDC-AI/Wings/blob/main/wings/utils.py#L81), but we are working on improving this.

We have released Wings<sub>base</sub>-Qwen1_5-8B, a version aligned with LLaVA-v1.5 pretrain and finetune training data.

<details>
<summary></b>Table of Contents</b></summary>

- [Why Wings?](#why-wings)
- [How to use](#how-to-use)
  - [Quick start](#quick-start)
  - [Citation](#citation)
- [License](#license)
- [Disclaimer](#disclaimer)
</details>

<br>

## Why Wings?

💡 TL;DR

- Wings is a brand-new universal Multimodal Large Language Model (MLLM). Its flexible multimodal structure enhances the MLLM as if **giving it wings that enhance the performance of multimodal capabilities** while minimizing text-only forgetting.

- **Any** architecture of MLLM can adapt the Wings component.

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM **catastrophically forgets the text-only instructions**, which do not include images and can be addressed within the initial LLM.

In this work, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that **text-only forgetting is related to the attention shifts from pre-image to post-image text.** From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, **like "wings" on either side, are connected in parallel within each layer's attention block.** Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the **Low-Rank Residual Attention (LoRRA)** to guarantee high efficiency for learners.

Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.

<div align="center">

<picture>
  <img alt="Wings logo" src="https://raw.githubusercontent.com/AIDC-AI/Wings/main/assets/images/bench_example.png" width="800px">
</picture>
</br>

</div>

## How to use

### Quick start (please route to GitHub)

+ Environment Setups:
  
  ```python
  conda create --name your_env_name python=3.10
  conda activate your_env_name
  pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
  pip install -r requirements.txt
  ```

+ Training:
  
  ```python
  bash run/pretrain_base.sh
  # Set path for pretrained MLLM
  bash run/finetune_base.sh
  ```

### Citation

+ If you find Wings useful, please cite the paper:
  
  ```
  @article{zhang_wings,
    author       = {Yi{-}Kai Zhang and
                    Shiyin Lu and
                    Yang Li and
                    Yanqing Ma and
                    Qing{-}Guo Chen and
                    Zhao Xu and
                    Weihua Luo and
                    Kaifu Zhang and
                    De{-}Chuan Zhan and
                    Han{-}Jia Ye},
    title        = {Wings: Learning Multimodal LLMs without Text-only Forgetting},
    journal      = {CoRR},
    volume       = {abs/2406.03496},
    year         = {2024}
  }
  ```

## License

This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) (SPDX-License-Identifier: Apache-2.0).

## Disclaimer

We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.