Ray2333's picture
Update README.md
0f56399 verified
---
license: mit
datasets:
- llm-blender/Unified-Feedback
language:
- en
metrics:
- accuracy
library_name: transformers
pipeline_tag: text-classification
---
## Introduction
The reward model finetunes [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on the '[llm-blender/Unified-Feedback](https://huggingface.co/datasets/llm-blender/Unified-Feedback)' dataset.
This model achieves an accuracy of **0.7740** on the test sets, making it a good proxy reward model for modeling human preferences and can be used for aligning LLMs.
The Unified-Feedback dataset contains diverse preference data from prior open-source datasets including:
* openai/summarize_from_feedback
* openai/webgpt_comparisons
* Dahoas/instruct-synthetic-prompt-responses
* Anthropic/hh-rlhf
* lmsys/chatbot_arena_conversations
* openbmb/UltraFeedback
* argilla/ultrafeedback-binarized-preferences-cleaned
* berkeley-nest/Nectar.
## Training Code and Blog
We merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling, which is based on the [trl](https://github.com/huggingface/trl) package. In addition, this [blog](https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=4) introduces some basic knowledge and shares experimental experience.
## Evaluation
We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is close to **current best 7B reward model** and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha.
| Model | Average | Chat | Chat Hard | Safety | Reasoning | Prior Sets |
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
| berkeley-nest/Starling-RM-34B (34B) | 81.5 | 96.9 | 59 | 89.9 | 90.3 | 71.4 |
| **Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback**(Ours, 7B) | 78.75 | 97.84 | 52.85 | 85.94 | 87.02 | 73.92 |
| berkeley-nest/Starling-RM-7B-alpha (7B) | 74.6 | 98 | 43.4 | 88.6 | 74.6 | 68.6 |
| openbmb/UltraRM-13b (13B) | 71.3 | 96.1 | 55.3 | 45.8 | 82 | 77.2 |
| IDEA-CCNL/Ziya-LLaMA-7B-Reward (7B) | 66 | 88 | 41.3 | 62.5 | 73.7 | 64.6 |
| OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 (1.4B) | 65.1 | 88.5 | 47.9 | 62.1 | 61.4 | 65.8 |
| OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 (7B) | 64 | 94.4 | 36.6 | 59.4 | 70 | 59.4 |
| llm-blender/PairRM-hf (0.4B) | 60.9 | 90.2 | 53 | 31.5 | 60 | 69.6 |
## Usage
```
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback')
reward_model = AutoModelForSequenceClassification.from_pretrained(
'Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback',
num_labels=1, torch_dtype=torch.float16,
device_map=0,
)
message = [
{'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her?"},
{'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"}
]
message_template = tokenizer.apply_chat_template(message, tokenize=False)
# it will look like this: "<s><s> [INST] I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her? [/INST]Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?</s>"
kwargs = {"padding": 'max_length', "truncation": True, "return_tensors": "pt"}
tokens = tokenizer.encode_plus(message_template, **kwargs)
with torch.no_grad():
reward_tensor = model(tokens["input_ids"][0].to(model.device), attention_mask=tokens["attention_mask"][0].to(model.device)).logits.reshape(-1)
reward = reward_tensor.cpu().detach().item()
```
## Citation
This reward model is used as a gold reward model for the following research https://arxiv.org/abs/2406.10216. If you find this model helpful for your research, please cite
```
@article{yang2024regularizing,
title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs},
author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong},
journal={arXiv preprint arXiv:2406.10216},
year={2024}
}
```