arxiv:2404.13208

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Published on Apr 19, 2024

· Submitted by

akhaliq on Apr 23, 2024

#3 Paper of the day

Authors:

Eric Wallace ,

,

,

Lilian Weng ,

,

Alex Beutel

Abstract

A hierarchical instruction framework for LLMs increases robustness against malicious prompts by selectively prioritizing system instructions over user inputs.

AI-generated summary

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

View arXiv page View PDF Add to collection

Community

Apr 23, 2024

Every now and again a paper comes out and I just want to slap myself in the face for not thinking of it myself.

arivero

Apr 23, 2024

They should explicitly add a mark to the position encoder for each level.

Apr 23, 2024

If you want a simplified overview of the paper, I have a plain english rewrite of it here: https://www.aimodels.fyi/papers/arxiv/instruction-hierarchy-training-llms-to-prioritize-privileged

·

Apr 24, 2024

Was that AI generated? If so, amazing!

Bookmarked

Apr 24, 2024

This comment has been hidden

Apr 24, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Jun 9, 2024

Fortifying LLMs: A Deep Dive into Instruction Hierarchies for Enhanced Security

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2404.13208 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.13208 in a Space README.md to link it from this page.

Collections including this paper 16