DongfuJiang commited on
Commit
8b56eb5
1 Parent(s): 0ddead0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -33
README.md CHANGED
@@ -3,55 +3,61 @@ tags:
3
  - generated_from_trainer
4
  base_model: llava-hf/llava-1.5-7b-hf
5
  model-index:
6
- - name: llava_1.5_7b_v2_4096
7
  results: []
8
  ---
9
 
10
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
- should probably proofread and complete it, then remove this comment. -->
12
 
13
- # llava_1.5_7b_v2_4096
14
 
15
- This model is a fine-tuned version of [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) on an unknown dataset.
16
 
17
- ## Model description
18
 
19
- More information needed
20
 
21
- ## Intended uses & limitations
22
 
23
- More information needed
 
 
 
 
24
 
25
- ## Training and evaluation data
26
 
27
- More information needed
28
 
29
- ## Training procedure
 
 
 
30
 
31
- ### Training hyperparameters
32
 
33
- The following hyperparameters were used during training:
34
- - learning_rate: 1e-05
35
- - train_batch_size: 1
36
- - eval_batch_size: 1
37
- - seed: 42
38
- - distributed_type: multi-GPU
39
- - num_devices: 8
40
- - gradient_accumulation_steps: 16
41
- - total_train_batch_size: 128
42
- - total_eval_batch_size: 8
43
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
44
- - lr_scheduler_type: cosine
45
- - lr_scheduler_warmup_ratio: 0.03
46
- - num_epochs: 1.0
47
 
48
- ### Training results
 
 
 
49
 
 
 
 
50
 
 
 
 
51
 
52
- ### Framework versions
 
53
 
54
- - Transformers 4.39.2
55
- - Pytorch 2.2.1
56
- - Datasets 2.17.1
57
- - Tokenizers 0.15.2
 
 
 
 
3
  - generated_from_trainer
4
  base_model: llava-hf/llava-1.5-7b-hf
5
  model-index:
6
+ - name: Mantis-llava-7b
7
  results: []
8
  ---
9
 
10
+ # Mantis: Interleaved Multi-Image Instruction Tuning
 
11
 
12
+ **Mantis** is a multimodal conversational AI model that can chat with users about images and text. It's optimized for multi-image reasoning, where interleaved text and images can be used to generate responses.
13
 
14
+ Mantis is trained on the newly curated dataset **Mantis-Instruct**, a large-scale multi-image QA dataset that covers various multi-image reasoning tasks.
15
 
16
+ |[Demo](https://huggingface.co/spaces/TIGER-Lab/Mantis) | [Blog](https://tiger-ai-lab.github.io/Blog/mantis) | [Github](https://github.com/TIGER-AI-Lab/Mantis) | [Models](https://huggingface.co/collections/TIGER-Lab/mantis-6619b0834594c878cdb1d6e4) |
17
 
18
+ ![Mantis](./overall_barchart.jpeg)
19
 
20
+ ## Inference
21
 
22
+ You can install Mantis's GitHub codes as a Python package
23
+ ```bash
24
+ pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
25
+ ```
26
+ then run inference with codes here: [examples/run_mantis.py](https://github.com/TIGER-AI-Lab/Mantis/blob/main/examples/run_mantis_hf.py)
27
 
28
+ Or, you can run the model without relying on the mantis codes, using pure hugging face transformers. See [examples/run_mantis_hf.py](https://github.com/TIGER-AI-Lab/Mantis/blob/main/examples/run_mantis_hf.py) for details.
29
 
 
30
 
31
+ ```python
32
+ from mantis.models.mllava import chat_mllava
33
+ from PIL import Image
34
+ import torch
35
 
 
36
 
37
+ image1 = "image1.jpg"
38
+ image2 = "image2.jpg"
39
+ images = [Image.open(image1), Image.open(image2)]
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ # load processor and model
42
+ from mantis.models.mllava import MLlavaProcessor, LlavaForConditionalGeneration
43
+ processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-bakllava-7b")
44
+ model = LlavaForConditionalGeneration.from_pretrained("TIGER-Lab/Mantis-bakllava-7b", device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
45
 
46
+ # chat
47
+ text = "<image> <image> What's the difference between these two images? Please describe as much as you can."
48
+ response, history = chat_mllava(text, images, model, processor)
49
 
50
+ print("USER: ", text)
51
+ print("ASSISTANT: ", response)
52
+ # The image on the right has a larger number of wallets displayed compared to the image on the left. The wallets in the right image are arranged in a grid pattern, while the wallets in the left image are displayed in a more scattered manner. The wallets in the right image have various colors, including red, purple, and brown, while the wallets in the left image are primarily brown.
53
 
54
+ text = "How many items are there in image 1 and image 2 respectively?"
55
+ response, history = chat_mllava(text, images, model, processor, history=history)
56
 
57
+ print("USER: ", text)
58
+ print("ASSISTANT: ", response)
59
+ # There are two items in image 1 and four items in image 2.
60
+ ```
61
+
62
+ ## Training
63
+ Training codes will be released soon.