rlhf - a hqfx Collection

hqfx 's Collections

hqfx/octupus-tool-call-v1

paper_collection

rlhf

rlhf

updated Sep 21

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Paper • 2406.02900 • Published Jun 5 • 11
Building Math Agents with Multi-Turn Iterative Preference Learning

Paper • 2409.02392 • Published Sep 4 • 14