license: mit
datasets:
- llm-blender/Unified-Feedback
language:
- en
metrics:
- accuracy
library_name: transformers
pipeline_tag: text-classification
Introduction
The reward model finetunes mistralai/Mistral-7B-Instruct-v0.2 on the 'llm-blender/Unified-Feedback' dataset. This model achieves an accuracy of 0.7740 on the test sets, making it a good proxy reward model for modeling human preferences and can be used for aligning LLMs.
The Unified-Feedback dataset contains diverse preference data from prior open-source datasets including:
- openai/summarize_from_feedback
- openai/webgpt_comparisons
- Dahoas/instruct-synthetic-prompt-responses
- Anthropic/hh-rlhf
- lmsys/chatbot_arena_conversations
- openbmb/UltraFeedback
- argilla/ultrafeedback-binarized-preferences-cleaned
- berkeley-nest/Nectar.
Training Code and Blog
We merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling, which is based on the trl package. In addition, this blog introduces some basic knowledge and shares experimental experience.
Evaluation
We evaluate this reward model on the reward model benchmark, which demonstrates that this model is close to current best 7B reward model and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha.
Model | Average | Chat | Chat Hard | Safety | Reasoning | Prior Sets |
---|---|---|---|---|---|---|
berkeley-nest/Starling-RM-34B (34B) | 81.5 | 96.9 | 59 | 89.9 | 90.3 | 71.4 |
Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback(Ours, 7B) | 78.75 | 97.84 | 52.85 | 85.94 | 87.02 | 73.92 |
berkeley-nest/Starling-RM-7B-alpha (7B) | 74.6 | 98 | 43.4 | 88.6 | 74.6 | 68.6 |
openbmb/UltraRM-13b (13B) | 71.3 | 96.1 | 55.3 | 45.8 | 82 | 77.2 |
IDEA-CCNL/Ziya-LLaMA-7B-Reward (7B) | 66 | 88 | 41.3 | 62.5 | 73.7 | 64.6 |
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 (1.4B) | 65.1 | 88.5 | 47.9 | 62.1 | 61.4 | 65.8 |
OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 (7B) | 64 | 94.4 | 36.6 | 59.4 | 70 | 59.4 |
llm-blender/PairRM-hf (0.4B) | 60.9 | 90.2 | 53 | 31.5 | 60 | 69.6 |
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback')
reward_model = AutoModelForSequenceClassification.from_pretrained(
'Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback',
num_labels=1, torch_dtype=torch.float16,
device_map=0,
)
message = [
{'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her?"},
{'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"}
]
message_template = tokenizer.apply_chat_template(message, tokenize=False)
# it will look like this: "<s><s> [INST] I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her? [/INST]Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?</s>"
kwargs = {"padding": 'max_length', "truncation": True, "return_tensors": "pt"}
tokens = tokenizer.encode_plus(message_template, **kwargs)
with torch.no_grad():
reward_tensor = model(tokens["input_ids"][0].to(model.device), attention_mask=tokens["attention_mask"][0].to(model.device)).logits.reshape(-1)
reward = reward_tensor.cpu().detach().item()
Citation
This reward model is used as a gold reward model for the following research https://arxiv.org/abs/2406.10216. If you find this model helpful for your research, please cite
@article{yang2024regularizing,
title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs},
author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong},
journal={arXiv preprint arXiv:2406.10216},
year={2024}
}