DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails Paper • 2502.05163 • Published Feb 7 • 22
PILAF: Optimal Human Preference Sampling for Reward Modeling Paper • 2502.04270 • Published Feb 6 • 11
Teaching Large Language Models to Reason with Reinforcement Learning Paper • 2403.04642 • Published Mar 7, 2024 • 51