arxiv:2606.12716

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

Published on Jun 10

Authors:

Abstract

AI peer-review systems face significant vulnerabilities to cross-modal attacks targeting both textual and visual elements of scientific papers, necessitating specialized benchmarks and defensive mechanisms.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

View arXiv page View PDF Add to collection

Community

rellabear

about 23 hours ago

Large language models (LLMs) and multimodal LLMs (MLLMs) are being rapidly integrated into scientific peer review, with major conferences such as AAAI, ICML, and NeurIPS now piloting AI-generated reviews. Yet this creates a new attack surface: an adversary can manipulate a submission to corrupt the AI reviewer's judgment. This threat is distinct from standard jailbreaking as the goal is not a general safety violation but a domain-specific, targeted failure such as inflating a paper's score. Existing AI reviewer robustness studies mainly focus on text, while the figures, tables, and charts modalities also carry core scientific evidence to human review process. To address this gap, we introduce PaperGuard, the first comprehensive benchmark to systematically attack and defend multimodal AI peer review: a multimodal peer-review dataset of 1,136 papers spanning AI/ML and broader scientific domains; a unified attack suite combining black-box prompt injection with white-box gradient-based perturbations on both text (GCG) and figures (PGD); and a lightweight, practical defense based on chunk-based embedding search that localizes hidden malicious instructions within long documents. Our experiments reveal pervasive vulnerability: black-box prompt injection achieves up to 80% attack success rate against powerful proprietary models, and stronger models are often more susceptible due to their superior instruction-following. Imperceptible perturbations to a single figure inflate review scores by up to +14.11 points, confirming the insufficiency of text-only safeguards, and these visual attacks transfer across model architectures and scales without gradient access to the target. Our defense approach reaches 95.0% detection accuracy, and detects all real-world hidden prompt injections found in the wild.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.12716

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12716 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12716 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.