File size: 5,978 Bytes
874cd50
 
 
147c7f2
 
438eb49
147c7f2
 
 
b1ad2db
 
 
 
147c7f2
 
 
38cb900
147c7f2
f03b15c
147c7f2
 
 
 
 
08631f9
38cb900
a9d26e3
b1ad2db
cfc9360
08631f9
38cb900
147c7f2
 
38cb900
b1ad2db
 
38cb900
 
438eb49
147c7f2
 
 
 
 
30c224b
147c7f2
 
 
 
 
 
 
a66ff9f
147c7f2
b1ad2db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a66ff9f
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: mit
---
# 🔥 SPHINX: A Mixer of Tasks, Domains, and Embeddings

Official implementation of ['SPHINX: A Mixer of Tasks, Domains, and Embeddings Advances Multi-modal Large Language Models'](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX).

Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!

<p align="left">
   Github link: <a href="https://huggingface.co/Alpha-VLLM/SPHINX" target="_blank">Github</a> • 👋 join our <a href="https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/docs/wechat.md" target="_blank">WeChat</a>
</p>


## Introduction

We present SPHINX, a versatile multi-modal large language model (MLLM) with a mixer of training tasks, data domains, and visual embeddings. 

- **Task Mix.** For all-purpose capabilities, we mix a variety of vision-language tasks for mutual improvement: VQA, REC, REG, OCR, DET, POSE, REL DET, T2I, etc.

- **Embedding Mix.** We capture robust visual representations by fusing distinct visual architectures, pre-training, and granularity.

- **Domain Mix.** For data from real-world and synthetic domains, we mix the weights of two domain-specific models for complementarity.

<p align="left">                                                                                                                                          
  <img src="figs/pipeline1.png"/ width="100%"> <br>
</p>

On top of SPHINX, we propose to further mix visual scales and sub-images for better capture fine-grained semantics on high-resolution images.
<p align="left">                                                                                                                                          
  <img src="figs/pipeline2.png"/ width="100%"> <br>
</p>


### Installation
SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.



## Inference
This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.


### Weights
We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/SPHINX). Please download them to your own machine. The file structure should appear as follows:
```
ckpt_path/
├── consolidated.00-of-02.model.pth
└── consolidated.01-of-02.model.pth
```

### Host Local Demo
Please follow the instructions [here](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX#host-local-demo) to see the instruction and complete the use of the model.



## Result

We provide a comprehensive evaluation of SPHINX and showcase results across multiple benchmarks. 

Our evaluation encompasses both **quantitative metrics** and **qualitative assessments**, providing a holistic understanding of our VLM model's performance.

**Evaluation Prompt Design**
<p align="left">                                                                                                                                          
  <img src="figs/table1.png"/ width="100%"> <br>
</p>

* In evaluation, we prioritize aligning with each benchmark's desired output format. 
* We employ distinct prompts tailored to benchmarks that necessitate long answers, short answers, and multiple-choice responses.
* For tasks involving visual grounding, we directly utilize the prompts during training to enhance the model's performance on these particular challenges.

**Benchmarks on Multimodal Large Language Models**
<p align="left">                                                                                                                                          
  <img src="figs/table2.png"/ width="100%"> <br>
</p

* We test our model on recently proposed MLLM benchmarks which is based on VQA to comprehensive evaluation of the model's characteristic such as MME, Seedbench, POPE, LLaVA-Bench (In-the-Wild), MM-Vet, MathVista, MMbench, CCbench. 
* The Long-SPHINX achieve new stat of arts result on 5 out of 9 benchmarks

**Visual Question Answering**
<p align="left">                                                                                                                                          
  <img src="figs/table3.png"/ width="100%"> <br>
</p>

* We evaluate general VQA benchmarks, such as VQAV2, OKVQA, GQA, vizwiz, scienceQA, visual spatial reasoning (VSR), IconQA. 
* Additionally, we conduct experiments on Text-oriented VQA such as TextVQA,OCR-VQA.
* Long-Sphinx achieve comparative results across all benchmarks. We observe that Long-Sphinx outperforms Sphinx in VQA datasets that demand fine-grained visual information, showcasing the effectiveness of our visual mixed-up approach for achieving high resolution without relying on a visual encoder trained specifically on high-resolution images.

**Visual Grounding**
<p align="left">                                                                                                                                          
  <img src="figs/table4.png"/ width="100%"> <br>
</p>

* The SPHINX model and baseline models on REC benchmarks results on table4.
* SPHINX exhibits robust performance in visual grounding tasks such as RefCOCO, RefCOCO+, and RefCOCOg, **surpassing other vision-language generalist models**.
* Notably, SPHINX outperforms specialist models G-DINO-L by **more than 1.54%** in accuracy across all tasks within RefCOCO/RefCOCO+/RefCOCOg. 


## Frequently Asked Questions (FAQ)

❓ Encountering issues or have further questions? Find answers to common inquiries [here](https://llama2-accessory.readthedocs.io/en/latest/faq.html). We're here to assist you!

## License

Llama 2 is licensed under the [LLAMA 2 Community License](LICENSE_llama2), Copyright (c) Meta Platforms, Inc. All Rights Reserved.