OmniLMM-12B / README.md
finalf0's picture
Create README.md
3efec51 verified
|
raw
history blame
5.31 kB
---
pipeline_tag: text-generation
---
## OmniLMM 12B
**OmniLMM-12B** is the most capable version. The model is built based on [EVA02-5B](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) and [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
- 🔥 **Strong Performance.**
OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also **supports OCR capability** and endows **rich multimodal world knowledge**.
- 🏆 **Trustworthy Behavior.**
LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using our recent [RLHF-V](https://rlhf-v.github.io/) technique) and **ranked #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench).
- 🕹 **Real-time Multimodal Interaction.**
We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
<table>
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>MME</th>
<th nowrap="nowrap" >MMMU val</th>
<th nowrap="nowrap" >MMHal-Bench</th>
<th nowrap="nowrap" >SeedBench-I</th>
<th nowrap="nowrap" >LLaVA Bench W</th>
<th>MathVista</th>
<th nowrap="nowrap">MMB dev (en)</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td align="left">GPT-4V †</td>
<td>-</td>
<td>1409</td>
<td>56.8</td>
<td>3.53 / 70.8</td>
<td>71.6 </td>
<td>93.1 </td>
<td>47.8 </td>
<td>75.1 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen-VL-Plus †</td>
<td>-</td>
<td>1681</td>
<td>45.2</td>
<td>- </td>
<td>65.7 </td>
<td>73.7 </td>
<td>36.0 </td>
<td>66.2 </td>
</tr>
<tr>
<td align="left">Yi-VL 6B</td>
<td align="right">6.7B </td>
<td>- </td>
<td>39.1 </td>
<td>- </td>
<td>66.1 </td>
<td>39.9 </td>
<td>28.0 </td>
<td>68.2 </td>
</tr>
<tr>
<td align="left" >CogVLM</td>
<td align="right">17.4B</td>
<td>1438</td>
<td>32.1 </td>
<td>2.68 / 52.1 </td>
<td>68.8 </td>
<td>73.9 </td>
<td>34.7 </td>
<td>63.7 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
<td align="right">9.6B</td>
<td>1488</td>
<td>35.9</td>
<td>2.93 / 59.4</td>
<td>64.8 </td>
<td>67.7 </td>
<td>33.8 </td>
<td>60.6 </td>
</tr>
<tr>
<td align="left" >LLaVA 1.5</td>
<td align="right">13.6B </td>
<td>1531 </td>
<td>36.4 </td>
<td>2.71 / 51.0 </td>
<td>68.1 </td>
<td>64.6 </td>
<td>26.4 </td>
<td>68.2 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
<td align="right">11.6B </td>
<td>1637 </td>
<td>40.7 </td>
<td>3.45 / 68.8 </td>
<td>71.1 </td>
<td>72.0 </td>
<td>34.9 </td>
<td>71.6 </td>
</tr>
</tbody>
</table>
<small>†: closed-source models</small>
## Demo
Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).
## Install
1. Clone this repository and navigate to the source folder
```bash
git clone https://github.com/OpenBMB/OmniLMM.git
cd OmniLMM
```
2. Create conda environment
```Shell
conda create -n OmniLMM python=3.10 -y
conda activate OmniLMM
```
3. Install dependencies
```shell
pip install -r requirements.txt
```
## Inference
### Multi-turn Conversation
Please refer to the following codes to run `OmniLMM`.
<div align="center">
<img src="assets/COCO_test2015_000000262144.jpg" width="660px">
</div>
##### OmniLMM-12B
```python
from chat import OmniLMMChat, img2base64
chat_model = OmniLMMChat('openbmb/OmniLMM-12B')
im_64 = img2base64('./data/COCO_test2015_000000262144.jpg')
# First round chat
msgs = [{"role": "user", "content": "What are the people doing?"}]
inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.process(inputs)
print(answer)
# Second round chat
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Describe the image"})
inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.process(inputs)
print(answer)
```
We can obtain the following results:
```
"The people in the image are playing baseball. One person is pitching a ball, another one is swinging a bat to hit it, and there's also an umpire present who appears to be watching the game closely."
"The image depicts a baseball game in progress. A pitcher is throwing the ball, while another player is swinging his bat to hit it. An umpire can be seen observing the play closely."
```