arxiv:2407.12784

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Published on Jul 17, 2024

· Submitted by

Zhaorun on Jul 18, 2024

#3 Paper of the day

Upvote

Authors:

Zhaorun Chen ,

Zhen Xiang ,

Abstract

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

View arXiv page View PDF Add to collection

Community

Zhaorun

Paper author Paper submitter Jul 18, 2024

AgentPoison is the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, AgentPoison incorporates a constrained trigger optimization algorithm that maps triggered instances into a unique and compact embedding space to achieve high ASR under attack, and high benign utility in non-attack cases simultaneously.

Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent.

Specifically,

😈 On each agent, AgentPoison achieves an average attack success rate of ≥ 80% with minimal impact on benign performance (≤ 1%) with a small poison rate < 0.1% !
🔥 Even when we inject only a single poisoning instance with a single-token trigger, AgentPoison achieves high ASR (≥ 60%) !!
👺 AgentPoison achieves high attack transferability across different RAG retrievers and high resilience against various perturbations and defenses !

🔥🔥 Our project is fully open-sourced! For more details, please refer to:
Project page: https://billchan226.github.io/AgentPoison.html
Code repo: https://github.com/BillChan226/AgentPoison
Dataset: https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K

nielsr

Jul 20, 2024

Hi @Zhaorun congrats on this work!

I see the model repository is currently empty, are you planning to upload the weights?

Here's how to do that: https://huggingface.co/docs/hub/models-uploading

Also, I see the data is currently available on Google Drive, would you be interested in making it available on the hub so that people can load it in 2 lines of code?

Here are some useful guides for that: