WPO: Enhancing RLHF with Weighted Preference Optimization Paper • 2406.11827 • Published 12 days ago • 13
Bootstrapping Language Models with DPO Implicit Rewards Paper • 2406.09760 • Published 16 days ago • 36
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM Paper • 2406.12168 • Published 12 days ago • 7
Understanding and Diagnosing Deep Reinforcement Learning Paper • 2406.16979 • Published 6 days ago • 8