File size: 5,314 Bytes

3efec51

---
pipeline_tag: text-generation
---


## OmniLMM 12B
**OmniLMM-12B** is the most capable version. The model is built based on [EVA02-5B](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) and [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:

- 🔥 **Strong Performance.** 

  OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also **supports OCR capability** and endows **rich multimodal world knowledge**.

- 🏆 **Trustworthy Behavior.** 

  LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using our recent [RLHF-V](https://rlhf-v.github.io/) technique) and **ranked #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench).

- 🕹 **Real-time Multimodal Interaction.** 

  We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.


<table>
<thead>
  <tr>
    <th align="left">Model</th>
    <th>Size</th>
    <th>MME</th>
    <th nowrap="nowrap" >MMMU val</th>
    <th nowrap="nowrap" >MMHal-Bench</th>
    <th nowrap="nowrap" >SeedBench-I</th>
    <th nowrap="nowrap" >LLaVA Bench W</th>
    <th>MathVista</th>
    <th nowrap="nowrap">MMB dev (en)</th>
  </tr>
</thead>
<tbody align="center">
  <tr>
    <td align="left">GPT-4V †</td>
    <td>-</td>
    <td>1409</td>
    <td>56.8</td>
    <td>3.53 / 70.8</td>
    <td>71.6 </td>
    <td>93.1 </td>
    <td>47.8 </td>
    <td>75.1 </td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left">Qwen-VL-Plus †</td>
    <td>-</td>
    <td>1681</td>
    <td>45.2</td>
    <td>- </td>
    <td>65.7 </td>
    <td>73.7 </td>
    <td>36.0 </td>
    <td>66.2 </td>
  </tr>
  <tr>
    <td align="left">Yi-VL 6B</td>
    <td align="right">6.7B </td>
    <td>- </td>
    <td>39.1 </td>
    <td>- </td>
    <td>66.1 </td>
    <td>39.9 </td>
    <td>28.0 </td>
    <td>68.2 </td>
  </tr>
  <tr>
    <td align="left" >CogVLM</td>
    <td align="right">17.4B</td>
    <td>1438</td>
    <td>32.1 </td>
    <td>2.68 / 52.1 </td>
    <td>68.8 </td>
    <td>73.9 </td>
    <td>34.7 </td>
    <td>63.7 </td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
    <td align="right">9.6B</td>
    <td>1488</td>
    <td>35.9</td>
    <td>2.93 / 59.4</td>
    <td>64.8 </td>
    <td>67.7 </td>
    <td>33.8 </td>
    <td>60.6 </td>
  </tr>
  <tr>
    <td align="left" >LLaVA 1.5</td>
    <td align="right">13.6B </td>
    <td>1531 </td>
    <td>36.4 </td>
    <td>2.71 / 51.0 </td>
    <td>68.1 </td>
    <td>64.6 </td>
    <td>26.4 </td>
    <td>68.2 </td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
    <td align="right">11.6B </td>
    <td>1637 </td>
    <td>40.7 </td>
    <td>3.45 / 68.8 </td>
    <td>71.1 </td>
    <td>72.0 </td>
    <td>34.9 </td>
    <td>71.6 </td>
  </tr>
</tbody>
</table>
<small>†: closed-source models</small>

## Demo
Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).

## Install

1. Clone this repository and navigate to the source folder

```bash
git clone https://github.com/OpenBMB/OmniLMM.git
cd OmniLMM
```

2. Create conda environment

```Shell
conda create -n OmniLMM python=3.10 -y
conda activate OmniLMM
```

3. Install dependencies

```shell
pip install -r requirements.txt
```

## Inference



### Multi-turn Conversation
Please refer to the following codes to run `OmniLMM`.

<div align="center">
<img src="assets/COCO_test2015_000000262144.jpg" width="660px">
</div>

##### OmniLMM-12B
```python
from chat import OmniLMMChat, img2base64

chat_model = OmniLMMChat('openbmb/OmniLMM-12B')

im_64 = img2base64('./data/COCO_test2015_000000262144.jpg')

# First round chat 
msgs = [{"role": "user", "content": "What are the people doing?"}]

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.process(inputs)
print(answer)

# Second round chat 
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Describe the image"})

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.process(inputs)
print(answer)
```

We can obtain the following results:
```
"The people in the image are playing baseball. One person is pitching a ball, another one is swinging a bat to hit it, and there's also an umpire present who appears to be watching the game closely."

"The image depicts a baseball game in progress. A pitcher is throwing the ball, while another player is swinging his bat to hit it. An umpire can be seen observing the play closely."
```