The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Abstract
Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.
Community
Every now and again a paper comes out and I just want to slap myself in the face for not thinking of it myself.
They should explicitly add a mark to the position encoder for each level.
If you want a simplified overview of the paper, I have a plain english rewrite of it here: https://www.aimodels.fyi/papers/arxiv/instruction-hierarchy-training-llms-to-prioritize-privileged
Was that AI generated? If so, amazing!
Bookmarked
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Automatic and Universal Prompt Injection Attacks against Large Language Models (2024)
- Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? (2024)
- Defending LLMs against Jailbreaking Attacks via Backtranslation (2024)
- InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents (2024)
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Fortifying LLMs: A Deep Dive into Instruction Hierarchies for Enhanced Security
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper