Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Abstract
Leading retrievers, including NV-Embed and LLM2Vec, can often select harmful passages for malicious queries, posing risks even when used with safety-aligned LLMs like Llama3.
Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.
Community
Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 💣
Retrievers need to be aligned too! 🚨
✨ AdvBench-IR
We create AdvBench-IR to evaluate if retrievers, such as LLM2Vec and NV-Embed, can select relevant harmful text from large corpora for a diverse range of malicious requests.
✨ Direct Malicious Retrieval
LLM-based retrievers correctly select malicious passages for more than 78% of AdvBench-IR queries (top-5)—a concerning level of capability. We also find that LLM alignment transfers poorly to retrieval. ⚠️
✨ Exploiting Instruction-Following Ability
Using fine-grained queries, a malicious user can steer the retriever to select specific passages that precisely match their malicious intent (e.g., constructing an explosive device with specific materials). 😈
✨ RAG-based exploitation
Using a RAG-based approach, even LLMs optimized for safety respond to malicious requests when harmful passages are provided in-context to ground their generation (e.g., Llama3 generates harmful responses to 67.12% of the queries with retrieval). 😬
Check out our paper for more details.
Paper: https://arxiv.org/abs/2503.08644
Data: https://huggingface.co/datasets/McGill-NLP/AdvBench-IR
Code: https://github.com/McGill-NLP/malicious-ir
Webpage: https://mcgill-nlp.github.io/malicious-ir/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model (2025)
- MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks (2025)
- Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation (2025)
- Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence (2025)
- From Retrieval to Generation: Comparing Different Approaches (2025)
- Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models (2025)
- A Practical Memory Injection Attack against LLM Agents (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper