Papers
arxiv:2406.15193

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

Published on Jun 21
ยท Submitted by hungchiayu on Jun 24

Abstract

The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.

Community

Paper author Paper submitter
โ€ข
edited Jun 24

Excited to introduce our decode-time alignment technique, Darwin! Darwin uses a reward-guided tree search framework to align LLM with an off-the-shelf reward model from RewardBench.

This strategy outperforms other decode-time alignment techniques and comparable performance to preference optimisation techniques on two widely accepted alignment benchmarks, Alpaca Eval2 and MT-bench.

More details in the paper and website!

https://darwin-alignment.github.io.

This comment has been hidden
This comment has been hidden

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.15193 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.15193 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.15193 in a Space README.md to link it from this page.

Collections including this paper 3