|
--- |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
|
|
## OmniLMM 12B |
|
**OmniLMM-12B** is the most capable version. The model is built based on [EVA02-5B](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) and [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features: |
|
|
|
- 🔥 **Strong Performance.** |
|
|
|
OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also **supports OCR capability** and endows **rich multimodal world knowledge**. |
|
|
|
- 🏆 **Trustworthy Behavior.** |
|
|
|
LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using our recent [RLHF-V](https://rlhf-v.github.io/) technique) and **ranked #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench). |
|
|
|
- 🕹 **Real-time Multimodal Interaction.** |
|
|
|
We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**. |
|
|
|
|
|
<table> |
|
<thead> |
|
<tr> |
|
<th align="left">Model</th> |
|
<th>Size</th> |
|
<th>MME</th> |
|
<th nowrap="nowrap" >MMMU val</th> |
|
<th nowrap="nowrap" >MMHal-Bench</th> |
|
<th nowrap="nowrap" >SeedBench-I</th> |
|
<th nowrap="nowrap" >LLaVA Bench W</th> |
|
<th>MathVista</th> |
|
<th nowrap="nowrap">MMB dev (en)</th> |
|
</tr> |
|
</thead> |
|
<tbody align="center"> |
|
<tr> |
|
<td align="left">GPT-4V †</td> |
|
<td>-</td> |
|
<td>1409</td> |
|
<td>56.8</td> |
|
<td>3.53 / 70.8</td> |
|
<td>71.6 </td> |
|
<td>93.1 </td> |
|
<td>47.8 </td> |
|
<td>75.1 </td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen-VL-Plus †</td> |
|
<td>-</td> |
|
<td>1681</td> |
|
<td>45.2</td> |
|
<td>- </td> |
|
<td>65.7 </td> |
|
<td>73.7 </td> |
|
<td>36.0 </td> |
|
<td>66.2 </td> |
|
</tr> |
|
<tr> |
|
<td align="left">Yi-VL 6B</td> |
|
<td align="right">6.7B </td> |
|
<td>- </td> |
|
<td>39.1 </td> |
|
<td>- </td> |
|
<td>66.1 </td> |
|
<td>39.9 </td> |
|
<td>28.0 </td> |
|
<td>68.2 </td> |
|
</tr> |
|
<tr> |
|
<td align="left" >CogVLM</td> |
|
<td align="right">17.4B</td> |
|
<td>1438</td> |
|
<td>32.1 </td> |
|
<td>2.68 / 52.1 </td> |
|
<td>68.8 </td> |
|
<td>73.9 </td> |
|
<td>34.7 </td> |
|
<td>63.7 </td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left" >Qwen-VL-Chat</td> |
|
<td align="right">9.6B</td> |
|
<td>1488</td> |
|
<td>35.9</td> |
|
<td>2.93 / 59.4</td> |
|
<td>64.8 </td> |
|
<td>67.7 </td> |
|
<td>33.8 </td> |
|
<td>60.6 </td> |
|
</tr> |
|
<tr> |
|
<td align="left" >LLaVA 1.5</td> |
|
<td align="right">13.6B </td> |
|
<td>1531 </td> |
|
<td>36.4 </td> |
|
<td>2.71 / 51.0 </td> |
|
<td>68.1 </td> |
|
<td>64.6 </td> |
|
<td>26.4 </td> |
|
<td>68.2 </td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td> |
|
<td align="right">11.6B </td> |
|
<td>1637 </td> |
|
<td>40.7 </td> |
|
<td>3.45 / 68.8 </td> |
|
<td>71.1 </td> |
|
<td>72.0 </td> |
|
<td>34.9 </td> |
|
<td>71.6 </td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
<small>†: closed-source models</small> |
|
|
|
## Demo |
|
Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081). |
|
|
|
## Install |
|
|
|
1. Clone this repository and navigate to the source folder |
|
|
|
```bash |
|
git clone https://github.com/OpenBMB/OmniLMM.git |
|
cd OmniLMM |
|
``` |
|
|
|
2. Create conda environment |
|
|
|
```Shell |
|
conda create -n OmniLMM python=3.10 -y |
|
conda activate OmniLMM |
|
``` |
|
|
|
3. Install dependencies |
|
|
|
```shell |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Inference |
|
|
|
|
|
|
|
### Multi-turn Conversation |
|
Please refer to the following codes to run `OmniLMM`. |
|
|
|
<div align="center"> |
|
<img src="assets/COCO_test2015_000000262144.jpg" width="660px"> |
|
</div> |
|
|
|
##### OmniLMM-12B |
|
```python |
|
from chat import OmniLMMChat, img2base64 |
|
|
|
chat_model = OmniLMMChat('openbmb/OmniLMM-12B') |
|
|
|
im_64 = img2base64('./data/COCO_test2015_000000262144.jpg') |
|
|
|
# First round chat |
|
msgs = [{"role": "user", "content": "What are the people doing?"}] |
|
|
|
inputs = {"image": im_64, "question": json.dumps(msgs)} |
|
answer = chat_model.process(inputs) |
|
print(answer) |
|
|
|
# Second round chat |
|
# pass history context of multi-turn conversation |
|
msgs.append({"role": "assistant", "content": answer}) |
|
msgs.append({"role": "user", "content": "Describe the image"}) |
|
|
|
inputs = {"image": im_64, "question": json.dumps(msgs)} |
|
answer = chat_model.process(inputs) |
|
print(answer) |
|
``` |
|
|
|
We can obtain the following results: |
|
``` |
|
"The people in the image are playing baseball. One person is pitching a ball, another one is swinging a bat to hit it, and there's also an umpire present who appears to be watching the game closely." |
|
|
|
"The image depicts a baseball game in progress. A pitcher is throwing the ball, while another player is swinging his bat to hit it. An umpire can be seen observing the play closely." |
|
``` |
|
|
|
|