AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
Abstract
LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.
Community
AgentPoison is the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, AgentPoison incorporates a constrained trigger optimization algorithm that maps triggered instances into a unique and compact embedding space to achieve high ASR under attack, and high benign utility in non-attack cases simultaneously.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent.
Specifically,
- š On each agent, AgentPoison achieves an average attack success rate of ā„ 80% with minimal impact on benign performance (ā¤ 1%) with a small poison rate < 0.1% !
- š„ Even when we inject only a single poisoning instance with a single-token trigger, AgentPoison achieves high ASR (ā„ 60%) !!
- šŗ AgentPoison achieves high attack transferability across different RAG retrievers and high resilience against various perturbations and defenses !
š„š„ Our project is fully open-sourced! For more details, please refer to:
Project page: https://billchan226.github.io/AgentPoison.html
Code repo: https://github.com/BillChan226/AgentPoison
Dataset: https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K
Hi @Zhaorun congrats on this work!
I see the model repository is currently empty, are you planning to upload the weights?
Here's how to do that: https://huggingface.co/docs/hub/models-uploading
Also, I see the data is currently available on Google Drive, would you be interested in making it available on the hub so that people can load it in 2 lines of code?
Here are some useful guides for that:
- loading and pushing a dataset to the hub: https://huggingface.co/docs/datasets/loading
- linking the dataset to the paper page: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper
Let me know if you need any help!
Cheers,
Niels
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Phantom: General Trigger Attacks on Retrieval Augmented Language Generation (2024)
- Exploring Backdoor Attacks against Large Language Model-based Decision Making (2024)
- BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models (2024)
- "Glue pizza and eat rocks"-- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models (2024)
- Corpus Poisoning via Approximate Greedy Gradient Descent (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper