Papers
arxiv:2404.13208

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Published on Apr 19
· Featured in Daily Papers on Apr 23
Authors:
,
,
,

Abstract

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

Community

Every now and again a paper comes out and I just want to slap myself in the face for not thinking of it myself.

They should explicitly add a mark to the position encoder for each level.

If you want a simplified overview of the paper, I have a plain english rewrite of it here: https://www.aimodels.fyi/papers/arxiv/instruction-hierarchy-training-llms-to-prioritize-privileged

·

Was that AI generated? If so, amazing!

Bookmarked

This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2404.13208 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.13208 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.13208 in a Space README.md to link it from this page.

Collections including this paper 13