Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study Paper • 2404.10719 • Published Apr 16, 2024 • 5
On Designing Effective RL Reward at Training Time for LLM Reasoning Paper • 2410.15115 • Published Oct 19, 2024