AI4VR commited on
Commit
f2c582f
โ€ข
1 Parent(s): acc59ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -3
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ ---
4
+
5
+ # Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions
6
+
7
+ ๐Ÿ“– [**Paper**](https://arxiv.org/abs/2402.11530) | ๐Ÿ  [**Code**](https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark) | ๐Ÿ“– [**Data**](https://huggingface.co/datasets/BAAI/Multimodal-Robustness-Benchmark)
8
+
9
+
10
+ ## Overview
11
+
12
+ MMR provides a comprehensive suite to evaluate the understanding capabilities of Multimodal Large Language Models (MLLMs) and their robustness when handling negative questions after correctly interpreting visual content. The MMR benchmark includes:
13
+
14
+ 1. **Multimodal Robustness (MMR) Benchmark and Targeted Evaluation Metrics:**
15
+ - Comprising 12 categories of paired positive and negative questions.
16
+ - Each question is meticulously annotated by experts to ensure scientific validity and accuracy.
17
+
18
+ 2. **Specially Designed Training Set:**
19
+ - Contains paired positive and negative visual question-answer samples to enhance robustness.
20
+
21
+ 3. **Combined Dataset and Models:**
22
+ - The new dataset merges the proposed dataset with existing ones.
23
+ - Trained models include Bunny-MMR-3B, Bunny-MMR-4B, and Bunny-MMR-8B.
24
+
25
+ In this repository, we provide Bunny-MMR-8B, which is built upon [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) and [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). More details about this model can be found in [GitHub](https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark).
26
+
27
+
28
+ ## Key Features
29
+
30
+ - **Rigorous Testing:**
31
+ - Extensive testing on leading MLLMs shows that while these models can correctly interpret visual content, they exhibit significant vulnerabilities when faced with leading questions.
32
+
33
+ - **Enhanced Robustness:**
34
+ - The targeted training significantly improves the MLLMs' ability to handle negative questions effectively.
35
+
36
+
37
+ # Quickstart
38
+
39
+ Here we show a code snippet to show you how to use the model with transformers.
40
+
41
+ Before running the snippet, you need to install the following dependencies:
42
+
43
+ ```shell
44
+ pip install torch transformers accelerate pillow
45
+ ```
46
+
47
+ ```python
48
+ import torch
49
+ import transformers
50
+ from transformers import AutoModelForCausalLM, AutoTokenizer
51
+ from PIL import Image
52
+ import warnings
53
+
54
+ # disable some warnings
55
+ transformers.logging.set_verbosity_error()
56
+ transformers.logging.disable_progress_bar()
57
+ warnings.filterwarnings('ignore')
58
+
59
+ # set device
60
+ torch.set_default_device('cpu') # or 'cuda'
61
+
62
+ # create model
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ 'AI4VR/Bunny-MMR-8B',
65
+ torch_dtype=torch.float16,
66
+ device_map='auto',
67
+ trust_remote_code=True)
68
+ tokenizer = AutoTokenizer.from_pretrained(
69
+ 'AI4VR/Bunny-MMR-8B',
70
+ trust_remote_code=True)
71
+
72
+ # text prompt
73
+ prompt = 'text prompt'
74
+ text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
75
+ text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
76
+ input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0)
77
+
78
+ # image, sample images can be found in images folder
79
+ image = Image.open('path/to/image')
80
+ image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
81
+
82
+ # generate
83
+ output_ids = model.generate(
84
+ input_ids,
85
+ images=image_tensor,
86
+ max_new_tokens=100,
87
+ use_cache=True)[0]
88
+
89
+ print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
90
+ ```
91
+
92
+ ## Citation
93
+ If you find this repository helpful, please cite the paper below.
94
+
95
+ ```bibtex
96
+ @article{he2024bunny,
97
+ title={Efficient Multimodal Learning from Data-centric Perspective},
98
+ author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo},
99
+ journal={arXiv preprint arXiv:2402.11530},
100
+ year={2024}
101
+ }
102
+ ```
103
+
104
+ ## License
105
+ This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
106
+ The content of this project itself is licensed under the [cc-by-4.0](./LICENSE).