-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 62 -
Understanding and Diagnosing Deep Reinforcement Learning
Paper • 2406.16979 • Published • 8 -
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper • 2404.03715 • Published • 58 -
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
Paper • 2407.00617 • Published • 7
Collections
Discover the best community collections!
Collections including paper arxiv:2403.10704
-
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper • 2403.10704 • Published • 56 -
ReFT: Representation Finetuning for Language Models
Paper • 2404.03592 • Published • 77 -
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Paper • 2404.07973 • Published • 29 -
Zephyr: Direct Distillation of LM Alignment
Paper • 2310.16944 • Published • 120
-
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper • 2403.10704 • Published • 56 -
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
Paper • 2403.13447 • Published • 17 -
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Paper • 2402.03620 • Published • 107 -
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 65
-
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper • 2403.10704 • Published • 56 -
WARM: On the Benefits of Weight Averaged Reward Models
Paper • 2401.12187 • Published • 17 -
RewardBench: Evaluating Reward Models for Language Modeling
Paper • 2403.13787 • Published • 19 -
DreamReward: Text-to-3D Generation with Human Preference
Paper • 2403.14613 • Published • 33