Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Paper โข 2402.14740 โข Published Feb 22 โข 11