arxiv:2407.14622

BOND: Aligning LLMs with Best-of-N Distillation

Published on Jul 19

· Submitted by

piergs on Jul 23

Upvote

Authors:

Pier Giuseppe Sessa ,

Robert Dadashi ,

Léonard Hussenot ,

Johan Ferret ,

Alexandre Ramé ,

Sarah Perrin ,

Andrea Michi ,

Olivier Bachem

Abstract

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

View arXiv page View PDF Add to collection

Community

piergs

Paper author Paper submitter about 19 hours ago

We present J-BOND 🕴️, a novel alignment method that steers the LLM towards the Best-of-N distribution via online distillation. This allows inheriting the strong properties of Best-of-N sampling, while requiring only a single sample at inference time.

To achieve this, J-BOND minimizes the Jeffreys divergence between the training policy and the Best-of-N distribution, trading off mode covering (forward KL) and mode seeking (backward KL) achieving the best of both divergences. Moreover, it implements an iterative distillation approach aiming at distilling the Best-of-N version of an Exponential Moving Average (EMA) anchor policy. This allows keeping reduced sample complexity and stable optimization, while the policy continuously improves its performance.
We demonstrate our design choices and overall approach on an abstractive summarization task and for the fine tuning of Gemma. Aligning Gemma policies with J-BOND led to superior performance than standard RLHF baselines, with improvements on several benchmarks.

librarian-bot

about 2 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.14622 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.14622 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.14622 in a Space README.md to link it from this page.