openbmb
/

OmniLMM-12B

+---
+pipeline_tag: text-generation
+---
+## OmniLMM 12B
+**OmniLMM-12B** is the most capable version. The model is built based on [EVA02-5B](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) and [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
+- 🔥 **Strong Performance.**
+  OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also **supports OCR capability** and endows **rich multimodal world knowledge**.
+- 🏆 **Trustworthy Behavior.**
+  LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using our recent [RLHF-V](https://rlhf-v.github.io/) technique) and **ranked #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench).
+- 🕹 **Real-time Multimodal Interaction.**
+  We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
+<table>
+<thead>
+  <tr>
+    <th align="left">Model</th>
+    <th>Size</th>
+    <th>MME</th>
+    <th nowrap="nowrap" >MMMU val</th>
+    <th nowrap="nowrap" >MMHal-Bench</th>
+    <th nowrap="nowrap" >SeedBench-I</th>
+    <th nowrap="nowrap" >LLaVA Bench W</th>
+    <th>MathVista</th>
+    <th nowrap="nowrap">MMB dev (en)</th>
+  </tr>
+</thead>
+<tbody align="center">
+  <tr>
+    <td align="left">GPT-4V †</td>
+    <td>-</td>
+    <td>1409</td>
+    <td>56.8</td>
+    <td>3.53 / 70.8</td>
+    <td>71.6 </td>
+    <td>93.1 </td>
+    <td>47.8 </td>
+    <td>75.1 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left">Qwen-VL-Plus †</td>
+    <td>-</td>
+    <td>1681</td>
+    <td>45.2</td>
+    <td>- </td>
+    <td>65.7 </td>
+    <td>73.7 </td>
+    <td>36.0 </td>
+    <td>66.2 </td>
+  </tr>
+  <tr>
+    <td align="left">Yi-VL 6B</td>
+    <td align="right">6.7B </td>
+    <td>- </td>
+    <td>39.1 </td>
+    <td>- </td>
+    <td>66.1 </td>
+    <td>39.9 </td>
+    <td>28.0 </td>
+    <td>68.2 </td>
+  </tr>
+  <tr>
+    <td align="left" >CogVLM</td>
+    <td align="right">17.4B</td>
+    <td>1438</td>
+    <td>32.1 </td>
+    <td>2.68 / 52.1 </td>
+    <td>68.8 </td>
+    <td>73.9 </td>
+    <td>34.7 </td>
+    <td>63.7 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
+    <td align="right">9.6B</td>
+    <td>1488</td>
+    <td>35.9</td>
+    <td>2.93 / 59.4</td>
+    <td>64.8 </td>
+    <td>67.7 </td>
+    <td>33.8 </td>
+    <td>60.6 </td>
+  </tr>
+  <tr>
+    <td align="left" >LLaVA 1.5</td>
+    <td align="right">13.6B </td>
+    <td>1531 </td>
+    <td>36.4 </td>
+    <td>2.71 / 51.0 </td>
+    <td>68.1 </td>
+    <td>64.6 </td>
+    <td>26.4 </td>
+    <td>68.2 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
+    <td align="right">11.6B </td>
+    <td>1637 </td>
+    <td>40.7 </td>
+    <td>3.45 / 68.8 </td>
+    <td>71.1 </td>
+    <td>72.0 </td>
+    <td>34.9 </td>
+    <td>71.6 </td>
+  </tr>
+</tbody>
+</table>
+<small>†: closed-source models</small>
+## Demo
+Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).
+## Install
+1. Clone this repository and navigate to the source folder
+```bash
+git clone https://github.com/OpenBMB/OmniLMM.git
+cd OmniLMM
+```
+2. Create conda environment
+```Shell
+conda create -n OmniLMM python=3.10 -y
+conda activate OmniLMM
+```
+3. Install dependencies
+```shell
+pip install -r requirements.txt
+```
+## Inference
+### Multi-turn Conversation
+Please refer to the following codes to run `OmniLMM`.
+<div align="center">
+<img src="assets/COCO_test2015_000000262144.jpg" width="660px">
+</div>
+##### OmniLMM-12B
+```python
+from chat import OmniLMMChat, img2base64
+chat_model = OmniLMMChat('openbmb/OmniLMM-12B')
+im_64 = img2base64('./data/COCO_test2015_000000262144.jpg')
+# First round chat
+msgs = [{"role": "user", "content": "What are the people doing?"}]
+inputs = {"image": im_64, "question": json.dumps(msgs)}
+answer = chat_model.process(inputs)
+print(answer)
+# Second round chat
+# pass history context of multi-turn conversation
+msgs.append({"role": "assistant", "content": answer})
+msgs.append({"role": "user", "content": "Describe the image"})
+inputs = {"image": im_64, "question": json.dumps(msgs)}
+answer = chat_model.process(inputs)
+print(answer)
+```
+We can obtain the following results:
+```
+"The people in the image are playing baseball. One person is pitching a ball, another one is swinging a bat to hit it, and there's also an umpire present who appears to be watching the game closely."
+"The image depicts a baseball game in progress. A pitcher is throwing the ball, while another player is swinging his bat to hit it. An umpire can be seen observing the play closely."
+```