arxiv:2310.13548

Towards Understanding Sycophancy in Language Models

Published on Oct 20, 2023

· Submitted by

akhaliq on Oct 23, 2023

Upvote

Authors:

Meg Tong ,

Newton Cheng ,

Esin Durmus ,

Shauna Kravec ,

Nicholas Schiefer ,

Ethan Perez

Abstract

Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgements are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophantic behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgements favoring sycophantic responses.

View arXiv page View PDF Add to collection

Community

librarian-bot

Oct 23, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

NigelNeil

Feb 23

Hi, just a typo. The above text says (currently) "... prefer convincingly-written sycophantic responses over correct ones a negligible fraction of the time." the actual text from the abstract makes more sense viz. '' prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time"

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2310.13548 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.13548 in a Space README.md to link it from this page.