Self-Rewarding Language Models

Published on Jan 18
· Featured in Daily Papers on Jan 19


We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.


Very nice paper! Do you plan to share the code to the community? Best, Yannick

@yleo of course not, this is bait and switch -- probably around some conference deadline

Paper author

Hi! Thanks! This was my prev answer on Twitter about model+code release:
“Well, as you know releasing models got way harder for corps in the current landscape, and we're a small team in FAIR Labs [+ NYU] with limited resources (e.g., not part of Llama team). For code, we've also had some approvals + other issues .. but hopeful to get there soon”.

Nice paper, I just waiting for such a cool news like this. 🤢

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Let the model reward itself. What could go wrong?

One of the cooler papers Ive seen. Nice work @weizhey and others!

Amazing work!!!
I have some naive concerns and I am wondering if you could help me. If the initial SPF trained model is required to align (for example, it will have some ethical issues). So will this kind of problem be solved after the Iterative training? My concern is that the model acting as a Judger cannot penalize such output. Because there is no human feedback involved in the entire training process, or the discriminator trained by human feedback.

Paper author
edited Jan 24

The reward model is executed by LLM-as-a-Judge by a chosen prompt (see Fig 2) in paper, so it would actually be very easy to incorporate safety rewards, e.g ethics issues .. simply rewrite the fig 2 prompt to include the wording that you want that would aim to make it safe.. ofc not discounting humans shouldn't be in the loop somewhere in the optimal system.

Paper author
edited Jan 24

@davegoldblatt Let humans reward the model. What could go wrong? 😂

(by which I mean, these are challenging problems requiring fundamental research .. whichever way you go..)

Thanks for your reply. Another idea, will the self-reward and iterative training method work in the Multimodal LLM(A unified model based on LLM that can understand and generate text and image via (de)tokenizor)? Because current MLLMs are mainly based on SFT and the generating capability of MLLM are not so good. So I wonder if it is an approach to improve MLLM.

This is amazing, will try it tonight ;-)

I have been through the code, seems to work well.

Do you have a notebook with a real Datasets (no mock) and model from hugging face?

I guess your x transformers config is Llama 70B… Is it compatible with any model?

With this (and other automatically improving / self-correcting LLM approaches), has the MAD paper ( been debunked?

Sign up or log in to comment

Models citing this paper 20

Browse 20 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 5

Collections including this paper 63