Deep RL Course documentation


Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started


Reinforcement learning from human feedback (RLHF) is a methodology for integrating human data labels into a RL-based optimization process. It is motivated by the challenge of modeling human preferences.

For many questions, even if you could try and write down an equation for one ideal, humans differ on their preferences.

Updating models based on measured data is an avenue to try and alleviate these inherently human ML problems.

Start Learning about RLHF

To start learning about RLHF:

  1. Read this introduction: Illustrating Reinforcement Learning from Human Feedback (RLHF).

  2. Watch the recorded live we did some weeks ago, where Nathan covered the basics of Reinforcement Learning from Human Feedback (RLHF) and how this technology is being used to enable state-of-the-art ML tools like ChatGPT. Most of the talk is an overview of the interconnected ML models. It covers the basics of Natural Language Processing and RL and how RLHF is used on large language models. We then conclude with open questions in RLHF.

  1. Read other blogs on this topic, such as Closed-API vs Open-source continues: RLHF, ChatGPT, data moats. Let us know if there are more you like!

Additional readings

Note, this is copied from the Illustrating RLHF blog post above. Here is a list of the most prevalent papers on RLHF to date. The field was recently popularized with the emergence of DeepRL (around 2017) and has grown into a broader study of the applications of LLMs from many large technology companies. Here are some papers on RLHF that pre-date the LM focus:

And here is a snapshot of the growing set of papers that show RLHF’s performance for LMs:


This section was written by Nathan Lambert