Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update Mar 11
Post
Some papers deserve a standing ovation after reading, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” is one such paper:

One major drawback of LLMs is the lack of precise control over their behavior which makes it very difficult to align with desired outcomes. Existing methods to mitigate this involves gathering generated, humanly labeled data and fine-tuning the unsupervised LLM to align with preferences - this is known as Reinforcement Learning From Human Feedback (RLHF).

RLHF is an incredibly complex, usually unstable and computationally costly method. It involves first scaling a suitable reward model that meets human preferences then fine-tuning the language model with RL to maximize the estimated reward while maintaining a major part of the original model.

This paper introduces a new algorithm called Direct Preference Optimization (DPO) that simplifies the whole process. In short, it directly optimizes the LM without explicit reward modeling or reinforcement learning. This is achieved by leveraging a mapping between reward functions and optimal policies, allowing the constrained reward maximization problem to be optimized exactly with a single stage of policy training.

DPO’s genius lies in its ability to intuitively increase the relative log probability of preferred to "unpreferred" responses.

The amazing thing about this paper is how fundamentally self-proven it is - from clearly stating the problem to explicitly explaining the underlying theory backed with mathematical proofs, it’s just genius.

In my opinion, every academic research paper should follow this approach. It won the 2023 NeurIPS Outstanding paper award (Category: Outstanding Main Track Runner-Ups).
In this post