Papers
arxiv:2407.01470

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Published on Jul 1
· Submitted by hank0316 on Jul 2
Authors:

Abstract

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the Domain knowledge merged Reward Model (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

Community

Paper author Paper submitter
edited 24 days ago

Our DogeRM framework merges the transformer layers and input embeddings from the reward model and a domain-specific SFT language model. We conducted experiments in the math and coding domains. The results demonstrate the potential of our method across various benchmarks, including RewardBench, Auto-J Eval, and Best-of-N Sampling on GSM8K/MBPP.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.01470 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.01470 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.01470 in a Space README.md to link it from this page.

Collections including this paper 3