Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF Paper • 2401.16335 • Published Jan 29 • 1
Towards Efficient and Exact Optimization of Language Model Alignment Paper • 2402.00856 • Published Feb 1
Preference-free Alignment Learning with Regularized Relevance Reward Paper • 2402.03469 • Published Feb 2
Teaching Large Language Models to Reason with Reinforcement Learning Paper • 2403.04642 • Published Mar 7 • 46
RewardBench: Evaluating Reward Models for Language Modeling Paper • 2403.13787 • Published Mar 20 • 21
PERL: Parameter Efficient Reinforcement Learning from Human Feedback Paper • 2403.10704 • Published Mar 15 • 57
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL Paper • 2403.03950 • Published Mar 6 • 13
In deep reinforcement learning, a pruned network is a good network Paper • 2402.12479 • Published Feb 19 • 17
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences Paper • 2404.03715 • Published Apr 4 • 60
Offline Regularised Reinforcement Learning for Large Language Models Alignment Paper • 2405.19107 • Published May 29 • 13
Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs Paper • 2406.08657 • Published Jun 12 • 9
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM Paper • 2406.12168 • Published Jun 18 • 7
THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation Paper • 2406.10996 • Published Jun 16 • 32
Understanding Reference Policies in Direct Preference Optimization Paper • 2407.13709 • Published Jul 18 • 16
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration Paper • 2410.18076 • Published 11 days ago • 4