File size: 3,132 Bytes
1f418ae
 
c3df38e
 
 
 
 
6e25aec
1f418ae
c3df38e
1f418ae
c3df38e
 
 
1f418ae
 
e685ba6
1f418ae
c3df38e
1f418ae
e685ba6
 
c3df38e
1f418ae
b623960
1f418ae
c3df38e
1f418ae
c3df38e
1f418ae
c3df38e
 
 
 
 
1f418ae
c3df38e
 
 
 
1f418ae
 
c3df38e
 
 
1f418ae
c3df38e
 
 
 
1f418ae
c3df38e
 
 
1f418ae
c3df38e
 
 
1f418ae
c3df38e
 
1f418ae
c3df38e
 
 
 
1f418ae
7039575
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
tags:
- Mantis
- VLM
- LMM
- Multimodal LLM
- bakllava
base_model: llava-hf/bakLlava-v1-hf
model-index:
- name: Mantis-bakllava-7b
  results: []
license: apache-2.0
language:
- en
---

# Mantis: Interleaved Multi-Image Instruction Tuning (Deprecated)

**Mantis** is a multimodal conversational AI model that can chat with users about images and text. It's optimized for multi-image reasoning, where interleaved text and images can be used to generate responses.

**Note that this is an older version of Mantis**, please refer to our newest version at [mantis-Siglip-llama3](https://huggingface.co/TIGER-Lab/Mantis-8B-siglip-llama3). The newer version improves significantly over both multi-image and single-image tasks.

Mantis is trained on the newly curated dataset **Mantis-Instruct**, a large-scale multi-image QA dataset that covers various multi-image reasoning tasks.

|[Demo](https://huggingface.co/spaces/TIGER-Lab/Mantis) | [Github](https://github.com/TIGER-AI-Lab/Mantis) |  [Models](https://huggingface.co/collections/TIGER-Lab/mantis-6619b0834594c878cdb1d6e4) |  

![Mantis](https://raw.githubusercontent.com/TIGER-AI-Lab/Mantis/main/docs/assets/images/overall_barchart.jpeg)

## Inference

You can install Mantis's GitHub codes as a Python package
```bash
pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
```
then run inference with codes here: [examples/run_mantis.py](https://github.com/TIGER-AI-Lab/Mantis/blob/main/examples/run_mantis_hf.py)

```python
from mantis.models.mllava import chat_mllava
from PIL import Image
import torch


image1 = "image1.jpg"
image2 = "image2.jpg"
images = [Image.open(image1), Image.open(image2)]

# load processor and model
from mantis.models.mllava import MLlavaProcessor, LlavaForConditionalGeneration
processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-bakllava-7b")
model = LlavaForConditionalGeneration.from_pretrained("TIGER-Lab/Mantis-bakllava-7b", device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")

# chat
text = "<image> <image> What's the difference between these two images? Please describe as much as you can."
response, history = chat_mllava(text, images, model, processor)

print("USER: ", text)
print("ASSISTANT: ", response)
# The image on the right has a larger number of wallets displayed compared to the image on the left. The wallets in the right image are arranged in a grid pattern, while the wallets in the left image are displayed in a more scattered manner. The wallets in the right image have various colors, including red, purple, and brown, while the wallets in the left image are primarily brown.

text = "How many items are there in image 1 and image 2 respectively?"
response, history = chat_mllava(text, images, model, processor, history=history)

print("USER: ", text)
print("ASSISTANT: ", response)
# There are two items in image 1 and four items in image 2.
```

Or, you can run the model without relying on the mantis codes, using pure hugging face transformers. See [examples/run_mantis_hf.py](https://github.com/TIGER-AI-Lab/Mantis/blob/main/examples/run_mantis_hf.py) for details.