arxiv:2410.01257

HelpSteer2-Preference: Complementing Ratings with Preferences

Published on Oct 2

· Submitted by

akhaliq on Oct 3

Upvote

Authors:

Zhilin Wang ,

Alexander Bukharin ,

Olivier Delalleau ,

Daniel Egert ,

Gerald Shen ,

Oleksii Kuchaiev ,

Abstract

Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at https://huggingface.co/datasets/nvidia/HelpSteer2 and openly release the trained Reward Model at https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Oct 3

https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward

zhilinw

Paper author Oct 3

Dataset (CC-BY-4.0 Licensed): https://huggingface.co/datasets/nvidia/HelpSteer2

Using this reward model for RLHF (specifically, REINFORCE), we were able to align a Llama-3.1-70B-Instruct model to reach AlpacaEval 2 LC of 57.6, Arena Hard of 85.0 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo

As of 1 Oct 2024, this aligned model is #1 on all three automatic alignment benchmarks, edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.

See details on HelpSteer2-Preference paper at https://arxiv.org/abs/2410.01257 - as a preview, this model can correctly the question How many r in strawberry? without specialized prompting or additional reasoning tokens:

A sweet question!
Let’s count the “R”s in “strawberry”:
1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y
There are **3 “R”s** in the word “strawberry”.