arxiv:2401.10020

Self-Rewarding Language Models

Published on Jan 18

· Submitted by

akhaliq on Jan 19

#1 Paper of the day

Upvote

138

Authors:

Weizhe Yuan ,

Richard Yuanzhe Pang ,

Kyunghyun Cho ,

Jason Weston

Abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.

View arXiv page View PDF Add to collection

Community

yleo

Jan 22

Very nice paper! Do you plan to share the code to the community? Best, Yannick

bmorphism

Jan 22

@yleo of course not, this is bait and switch -- probably around some conference deadline

spermwhale

Paper author Jan 22

Hi! Thanks! This was my prev answer on Twitter about model+code release:
“Well, as you know releasing models got way harder for corps in the current landscape, and we're a small team in FAIR Labs [+ NYU] with limited resources (e.g., not part of Llama team). For code, we've also had some approvals + other issues .. but hopeful to get there soon”.

Imran1

Jan 22

Nice paper, I just waiting for such a cool news like this. 🤢

librarian-bot

Jan 22

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

davegoldblatt

Jan 23

Let the model reward itself. What could go wrong?

derek-thomas

Jan 23

•

edited Jan 23

One of the cooler papers Ive seen. Nice work @weizhey and others!

YxxxB

Jan 23

Amazing work!!!
I have some naive concerns and I am wondering if you could help me. If the initial SPF trained model is required to align (for example, it will have some ethical issues). So will this kind of problem be solved after the Iterative training? My concern is that the model acting as a Judger cannot penalize such output. Because there is no human feedback involved in the entire training process, or the discriminator trained by human feedback.
Thanks!!!

spermwhale

Paper author Jan 24

•

edited Jan 24

The reward model is executed by LLM-as-a-Judge by a chosen prompt (see Fig 2) in paper, so it would actually be very easy to incorporate safety rewards, e.g ethics issues .. simply rewrite the fig 2 prompt to include the wording that you want that would aim to make it safe.. ofc not discounting humans shouldn't be in the loop somewhere in the optimal system.

spermwhale

Paper author Jan 24

•

edited Jan 24

@davegoldblatt Let humans reward the model. What could go wrong? 😂

(by which I mean, these are challenging problems requiring fundamental research .. whichever way you go..)

YxxxB

Jan 25

Thanks for your reply. Another idea, will the self-reward and iterative training method work in the Multimodal LLM(A unified model based on LLM that can understand and generate text and image via (de)tokenizor)? Because current MLLMs are mainly based on SFT and the generating capability of MLLM are not so good. So I wonder if it is an approach to improve MLLM.

yingjun128

Jan 27

https://github.com/lucidrains/self-rewarding-lm-pytorch

yleo

Jan 27

This is amazing, will try it tonight ;-)

yleo

Jan 27

•

edited Jan 27

I have been through the code, seems to work well.

Do you have a notebook with a real Datasets (no mock) and model from hugging face?

I guess your x transformers config is Llama 70B… Is it compatible with any model?

dball

Mar 6

With this (and other automatically improving / self-correcting LLM approaches), has the MAD paper (https://huggingface.co/papers/2307.01850) been debunked?