AI4VR
/

Bunny-MMR-8B

+---
+license: cc-by-4.0
+---
+# Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions
+📖 [**Paper**](https://arxiv.org/abs/2402.11530) | 🏠 [**Code**](https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark) | 📖 [**Data**](https://huggingface.co/datasets/BAAI/Multimodal-Robustness-Benchmark)
+## Overview
+MMR provides a comprehensive suite to evaluate the understanding capabilities of Multimodal Large Language Models (MLLMs) and their robustness when handling negative questions after correctly interpreting visual content. The MMR benchmark includes:
+1. **Multimodal Robustness (MMR) Benchmark and Targeted Evaluation Metrics:**
+   - Comprising 12 categories of paired positive and negative questions.
+   - Each question is meticulously annotated by experts to ensure scientific validity and accuracy.
+2. **Specially Designed Training Set:**
+   - Contains paired positive and negative visual question-answer samples to enhance robustness.
+3. **Combined Dataset and Models:**
+   - The new dataset merges the proposed dataset with existing ones.
+   - Trained models include Bunny-MMR-3B, Bunny-MMR-4B, and Bunny-MMR-8B.
+In this repository, we provide Bunny-MMR-8B, which is built upon [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) and [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). More details about this model can be found in [GitHub](https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark).
+## Key Features
+- **Rigorous Testing:**
+  - Extensive testing on leading MLLMs shows that while these models can correctly interpret visual content, they exhibit significant vulnerabilities when faced with leading questions.
+- **Enhanced Robustness:**
+  - The targeted training significantly improves the MLLMs' ability to handle negative questions effectively.
+# Quickstart
+Here we show a code snippet to show you how to use the model with transformers.
+Before running the snippet, you need to install the following dependencies:
+```shell
+pip install torch transformers accelerate pillow
+```
+```python
+import torch
+import transformers
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from PIL import Image
+import warnings
+# disable some warnings
+transformers.logging.set_verbosity_error()
+transformers.logging.disable_progress_bar()
+warnings.filterwarnings('ignore')
+# set device
+torch.set_default_device('cpu')  # or 'cuda'
+# create model
+model = AutoModelForCausalLM.from_pretrained(
+    'AI4VR/Bunny-MMR-8B',
+    torch_dtype=torch.float16,
+    device_map='auto',
+    trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(
+    'AI4VR/Bunny-MMR-8B',
+    trust_remote_code=True)
+# text prompt
+prompt = 'text prompt'
+text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
+text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
+input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0)
+# image, sample images can be found in images folder
+image = Image.open('path/to/image')
+image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
+# generate
+output_ids = model.generate(
+    input_ids,
+    images=image_tensor,
+    max_new_tokens=100,
+    use_cache=True)[0]
+print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
+```
+## Citation
+If you find this repository helpful, please cite the paper below.
+```bibtex
+@article{he2024bunny,
+      title={Efficient Multimodal Learning from Data-centric Perspective},
+      author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo},
+      journal={arXiv preprint arXiv:2402.11530},
+      year={2024}
+}
+```
+## License
+This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
+The content of this project itself is licensed under the [cc-by-4.0](./LICENSE).