Mirage-in-the-Eyes: Hallucination Attack on Multi-modal Large Language Models with Only Attention Sink
NOTE: To prevent potetial harm, we release our source code only upon request for research purposes.
Overview
Fusing visual understanding into language generation, Multi-modal Large Language Models (MLLMs) are revolutionizing visual-language applications. Yet, these models are often plagued by the hallucination problem, which involves generating inaccurate objects, attributes, and relationships that do not match the visual content. In this work, we delve into the internal attention mechanisms of MLLMs to reveal the underlying causes of hallucination, exposing the inherent vulnerabilities in the instruction-tuning process.
We propose a novel hallucination attack against MLLMs that exploits attention sink behaviors to trigger hallucinated content with minimal image-text relevance, posing a significant threat to critical downstream applications. Distinguished from previous adversarial methods that rely on fixed patterns, our approach generates dynamic, effective, and highly transferable visual adversarial inputs, without sacrificing the quality of model responses. Extensive experiments on 6 prominent MLLMs demonstrate the efficacy of our attack in compromising black-box MLLMs even with extensive defensive mechanisms, as well as the promising results against cutting-edge commercial APIs, such as GPT-4o and Gemini 1.5.
Usage
We provide the code files for hallucination attack, MLLM response generation, and GPT-4 assisted evaluation in this repository. The main implementation of Mirage-in-the-Eyes is in attack.py
.
The recommended usage is as follows:
- Prepare the environment.
cd Mirage-in-the-Eyes
conda create -n mllm python==3.9.20
conda activate mllm
pip install -r requirements.txt
python -m pip install -e transformers-4.29.2
Prepare the Hallubench dataset according to the official repository.
Set up the model path configs in
Mirage-in-the-Eyes/minigpt4/configs
andMirage-in-the-Eyes/minigpt4/models
.Run our hallucination attack:
CUDA_VISIBLE_DEVICES=GPU_ID python attack.py --model MODEL_NAME --gpu-id GPU_ID --data-path /path/to/hallubench --images-path /path/to/images --save-path /path/to/adv_images --generation-mode greedy --eps 2
- Generate MLLM responses with adversarial visual inputs:
CUDA_VISIBLE_DEVICES=GPU_ID python generate.py --model MODEL_NAME --gpu-id GPU_ID --data-path /path/to/hallubench --images-path /path/to/images --response-path /path/to/response.json
- Evaluate MLLM responses for hallucinations:
cd eval
python json_eval.py --json-file /path/to/response.json --bench-path /path/to/hallubench --log-path /path/to/log
Acknowledgement
This repo is based on the MLLM codebase of OPERA. We sincerely thank the contributors for their valuable work.